home
bytes
articles
introduction to nlp
Sumanta Muduli
Data Scientist at Flutura Decision Sciences & Analytics at almaBetter
4 mins
2385
Computers are great at working with structured data like spreadsheets and database tables, but we humans usually communicate in words, not in tables. Computers couldn’t understand those. To solve this problem, we have to come up with some advanced techniques. In NLP, we use some very smart techniques that convert languages to useful information like numbers or some mathematically interpretable objects so that we could use them in ML algorithms based upon our requirements.
Machine Learning needs data in numeric form. We first need to clean the textual data and this process to prepare(or clean) text data before encoding is called text preprocessing, this is the very first step to solve the NLP problems. SpaCy, NLTK are some libraries used to make our tasks of preprocessing easier.
Cleaning
1. Removing URL-
Importing re library to remove URL.
import re
def clean_url(text):
return re.sub(r’http\S+, “”, text)
dataset[‘Clean_text’]=dataset[‘Review’].apply(clean_url)
Punctuation is basically the set of symbols [!”#$%&’()*+,-./:;<=>?@[]^_`{|}~]:
def clean_punctuations(text):
return re.sub(‘[^a-zA-Z]’,’ ', text)
dataset[‘Clean_text’]=dataset[‘Clean_text’].apply(clean_punctuations)
3. Converting all to lower case
def clean_lower(text):
return str(text).lower()
dataset[‘Clean_text’]=dataset[‘Clean_text’].apply(clean_lower)
4. Removing stopwords
import nltk nltk.download(‘stopwords’)
from nltk.corpus import stopwords
stop=set(stopwords.words(‘english’))
def clean_stopwords(text):
return ’ '.join([item for item in text if item not in stop])
dataset[‘Clean_text’]=dataset[‘Clean_text’].apply(clean_stopwords)
5. Tokenization-
It’s a method of splitting a string into smaller units called tokens. A token could be a punctuation, word, mathematical symbol, number etc.
from nltk.tokenize import word_tokenize
def clean_tokenization(text):
return word_tokenize(text)
dataset[‘Clean_text’]=dataset[‘Clean_text’].apply(clean_tokenization)
6. Stemming and Lemmatization-
from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
def Clean_stem(token):
return [stemmer.stem(i) for i in token]
dataset[‘Clean_text’]=dataset[‘Clean_text’].apply(Clean_stem)
from nltk.stem.snowball import SnowballStemmer
lemma = SnowballStemmer(“english”)
def clean_lemmatization(token):
return [lemma.lemmatize(word=w,pos='v) for w in token]
dataset[‘Clean_text’]=dataset[‘Clean_text’].apply(clean_lemmatization)
7. Removing small words having length ≤2
After performing all required process in text processing there is some kind of noise is present in our corpus, so like that i am removing the words which have very short length.
def Clean_length(token):
return [i for i in token if len(i)>2]
dataset[‘Clean_text’]=dataset[‘Clean_text’].apply(Clean_length)
8. Convert the list into string back
def convert_to_string(list1):
return ’ ’ .join(list1)
dataset[‘Clean_text’]=dataset[‘Clean_text’].apply(convert_to_string)
Now we are all set to vectorize our text.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_df = 0.9,min_df = 10)
X = vectorizer.fit_transform(dataset[‘Clean_text’])
2. TF-IDF: In TF-IDF we transform a count matrix to a normalized tf: term-frequency or term-frequency times inverse document-frequency representation using TfidfTransformer. The formula that is used to compute the tf-idf for a term t of a document d in a document set is:
from sklearn.feature_extraction.text import TfidfVectorizer as tf
tf = tf(max_df = 0.9,min_df = 10)
X = tf.fit_transform(dataset[‘Clean_text’])
Note-
In CountVectorizer we only count the number of times a word appears in the document which results in biasing in favour of most frequent words. This ends up in ignoring rare words which could have helped is in processing our data more efficiently.
To overcome this , we use TfidfVectorizer .
In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.
That’s all folks, Have a nice day ????