Data Science

Basics of Natural Language Processing

Last Updated: 30th June, 2023

Rohan Roney

AlmaBetter Student at almaBetter

NLP refers to the artificial intelligence method of communicating with an intelligent system using the natural language.

Human beings are the most advanced species on Earth. Our success as humans comes from the ability to communicate and share information. That is precisely where the concept of developing a language comes into the picture and when we talk about the human language it is one of the most diverse and complex part of us considering a total of close to 6500 languages.

Another very well known fact is that in the 21st century, according to industry insights only 21% of available data is present in a structured form. Data is being generated as speech, audio, tweets, and messages i.e. majority of the data is present in the textual format thus being highly unstructured in nature. Now in order to produce significant and actionable insights from this data it is important to get acquainted with the techniques of text analysis and natural language processing aka NLP.

Introduction

Text Mining or Text Analysis is the process of deriving meaningful information from natural language text. It usually involves the process of structuring the input text, deriving patterns from the structured data and finally the interpreting and evaluating the output. On the other hand NLP refers to the artificial intelligence method of communicating with an intelligent system using the natural language. As text mining refers to process of deriving high quality information from the text the overall is to essentially turn the text into data analysis via the application of NLP. That is why text mining and NLP go hand in hand.

Natural language processing (NLP) is the intersection of computer science, linguistics and machine learning. The field focuses on communication between computers and humans in natural language and NLP is all about making computers understand and generate human language.

Applications of NLP

One of the first and most important applications of NLP is sentiment analysis be it on twitter, facebook or any other social media forum.

Chatbots is another very important application. This is common in many of the company websites which automate certain tasks and Q&A through these bots.

Another application of NLP comes in the domain of speech recognition which also include the voice assistants like Siri, Cortana etc.

Machine translation is another use case one of the most famous one being the google translator translating from one language to another and that too in real time.

Other applications include spell check, keyword searching and information extraction and advertisement matching.

Components of NLP

NLP can be divided into 2 major components:

Natural Language Understanding: This is the process of mapping the given input in natural language into useful representation and analysing those aspects of language.

Natural Language Generation: It is the process of generating meaningful phrases and sentences in the form of natural language from some internal representation.

The understanding part is much more difficult than the generation part as it takes a lot of time to understand a particular language, the intricacies of it. Going forward lets try and understand what are the steps involved in NLP.

Various Steps for NLP

Tokenisation

Its the first step of NLP. It is the process of breaking strings into tokens which in turn are small structures used for tokenisation. Eg “This is NLP” can be broken down into 3 tokens namely This , is , NLP.

Stemming

It means normalising the words into its base or root form. Eg. Affects, affecting etc. comes from the base word “affect”. Stemming algorithm generally works by cutting the end or beginning of the word taking into account a list of common prefixes/suffixes that can be found in the infected word.

Lemmatization

It takes into account the morphological analysis of the word. It is necessary to have a detailed dictionary which the algorithm can look through to link the form back to root word called ‘lemma’. Similar to stemming, it maps several words into one common root. But unlike stemming the output is a proper word which is not always the answer in stemming.

POS Tags

Once we have the root forms of the token, next comes the POS tags . The grammatical type of the tokens is called the Parts of Speech or the POS tags. Eg. verb, noun, article, adjective etc. It indicates how a word functions in meaning as well as grammatically within the sentence.

Named Entity Recognition

It is the process of detecting named entities i.e. person name , company names, locations etc. It has 3 steps primarily: Noun phase identification, the phrase classification and entity disambiguation. Following is the example of NER

Chunking

This basically means picking up individual pieces of information and “grouping” them into bigger pieces. These are also called chunks. This helps in getting insights and meaningful information from the text.

All of these steps and functions are performed using the Natural Language Toolkit Language (nltk) python library and is used for various other text analysis and NLP tasks.

Major Challenges in NLP

There are still a number of Natural Language Processing limitations and problems:

Contextual words and phrases and homonyms
Synonyms
Irony and sarcasm
Ambiguity
Errors in text or speech
Colloquialisms and slang
Domain-specific language
Low-resource languages
Lack of research and development

Although the future looks extremely tough and challenging and full of threats for NLP, the discipline is developing at a very fast pace like never before and we are likely to reach a level of advancement in the coming years that will make complex applications look possible.