A special kind of beauty exists which is born in language, of language, and for language.
Data Scientists work with tons of data, and many times that data includes natural languages like text and speech. That text is usually quite similar to the natural language that we use in our day-to-day life. They have to convert those natural languages into machine-readable forms.
In this blog, we are going to see some common NLP techniques, with the help of which we can begin performing analysis and building models from textual data.
What is Natural Language?
- From data prospectives, Natural Language refers to speech and text.
- From processing in NLP we mean, how to process our Natural Language (text and speech) into a certain form that computers can process and understand.
So, let’s start with a formal definition…
What is NLP??
- NLP stands for Natural Language Processing.
- It is the branch of Artificial Intelligence that deals with the interaction of machines with human language.
- The ultimate objective of NLP is to read, understand, and make sense of human languages in a manner that is valuable.
Use Cases of NLP
There are various use cases of NLP in our day-to-day life. It includes:
- Voice Assistants (Alexa, Siri, Cortana, etc)
- Recommender systems
- Spam detection
- Search Engines like Google and Yahoo
- News Feed (Facebook, Google, Instagram Etc)
- Autocorrect and autocomplete
- Sentiment Analysis
- Google Translate
Computers are great at working with structured data like spreadsheets and database tables, but the problem is we humans usually communicate in words, not in tables. Computers couldn’t understand those. To solve this problem, we have to come up with some advanced techniques. In NLP, we use some very smart techniques that convert languages to useful information like numbers or some mathematically interpretable objects so that we could use them in ML algorithms based upon our requirements.
Machine Learning needs data in numeric form. We first need to clean the textual data and this process to prepare(or clean) text data before encoding is called text preprocessing.
This is the very first step to solve the NLP problems. We have few inbuilt libraries to deal with these textual data. SpaCy, NLTK are some of them. These libraries are used to make our tasks of preprocessing easier.
NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces. Also, it contains a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Best of all, NLTK is a free, open-source, community-driven project.
We have few steps in text preprocessing which we have to follow once we get our raw data.
- The first step involves cleaning all the documents that we get.
- The document we get may contain punctuations and URLs which have nothing to do with our model so there is a need to remove those things.
- It involves removing the URLs, punctuations, and any undesirable characters.
Lower case Conversion
- Our document may contain the same word multiple times one starting with Upper case and the other with lower case.
- If not treated Computer treats those words as different.
- So what we are doing is converting all the words in the documents into lower case.
Stopwords are the most common words in any natural language. For the purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document. Generally, the most common words used in a text are “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc.
So we remove these stopwords from each document. We can easily do so by using nltk library.
The next step is tokenization (also called segmentation) of the document.
Tokenization is essentially splitting a phrase, sentence, paragraph, or entire text document into smaller units, such as individual words or terms. Each of these smaller units is called a token.
Tokens are the building blocks of Natural Language.
Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems.
For example, the stem of the words eating, eats, eaten is eat.
Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.
Lemmatization is the process of converting a word to its base form. It is similar to stemming.
The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.
Now, the thing is can we just give our algorithm a bunch of text data and expect anything to happen? Unfortunately, no we can’t.
Algorithms have a hard time understanding text data so we need to transform the data into something the model can understand. Computers are exceptionally good at understanding numbers so how about we try that. If we represent the text in each document as a vector of numbers then our algorithm will be able to understand this and proceed accordingly. What we will be doing is transforming the text in the body of each document into a vector of numbers using a vectorizer.
Once we are done with the data cleaning, tokenization, and stemming/lemmatization we have to vectorize our document. Let’s do that.
CountVectorizer is a great tool provided by the sci-kit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample. This way of representation is known as a Sparse Matrix.
TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.
TF-IDF weight is a product of two terms:
- The first term is the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document;
- The second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
IDF(t) = loge(Total number of documents/Number of documents with term t in it)
And we get an output in the form of a sparse matrix.
That’s all about some basic NLP techniques….