Introduction to NLP Techniques

What is Natural Language?

  • From data prospectives, Natural Language refers to speech and text.
  • From processing in NLP we mean, how to process our Natural Language (text and speech) into a certain form that computers can process and understand.

What is NLP??

  • NLP stands for Natural Language Processing.
  • It is the branch of Artificial Intelligence that deals with the interaction of machines with human language.
  • The ultimate objective of NLP is to read, understand, and make sense of human languages in a manner that is valuable.

Use Cases of NLP

There are various use cases of NLP in our day-to-day life. It includes:

  • Recommender systems
  • Spam detection
  • Search Engines like Google and Yahoo
  • Chatbots
  • Grammarly
  • News Feed (Facebook, Google, Instagram Etc)
  • Autocorrect and autocomplete
  • Sentiment Analysis
  • Google Translate

Text Preprocessing

Machine Learning needs data in numeric form. We first need to clean the textual data and this process to prepare(or clean) text data before encoding is called text preprocessing.

Cleaning

  • The first step involves cleaning all the documents that we get.
  • The document we get may contain punctuations and URLs which have nothing to do with our model so there is a need to remove those things.
  • It involves removing the URLs, punctuations, and any undesirable characters.

Lower case Conversion

  • Our document may contain the same word multiple times one starting with Upper case and the other with lower case.
  • If not treated Computer treats those words as different.
  • So what we are doing is converting all the words in the documents into lower case.

Removing Stopwords

Stopwords are the most common words in any natural language. For the purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document. Generally, the most common words used in a text are “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc.

Tokenization

The next step is tokenization (also called segmentation) of the document.

Stemming

Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems.

Lemmatization

Lemmatization is the process of converting a word to its base form. It is similar to stemming.

Count Vectorizer

CountVectorizer is a great tool provided by the sci-kit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

TF-IDF Vectorizer

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

  • The second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

IDF(t) = loge(Total number of documents/Number of documents with term t in it)

And we get an output in the form of a sparse matrix.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Bhanu Shahi

Bhanu Shahi

Data Analyst at Decimal Tech | Machine Learning | NLP | Time Series | Python, Tableau & SQL Expert | Storyteller | Blogger