TF-IDF/Term Frequency Technique: Easiest explanation for Text classification in NLP with Python (2024)

OR How to find meaning of sentences and documents

Rohit Madan

Published in

Analytics Vidhya

Answer me this —

Imagine there’s a document full of sentences, what is the best way to break it so that a machine can make some sense of what it is ?
1. Break it in words
2. Break it in letters
3. Break it in sentences
4. Break it in bytes

Can you answer it ?

The current answer is option 3. Break it in sentences .

Why ? cuz when you break a document in multiple sentences, each sentence has multiple words which represent provide some context to sentences and these sentences as a whole provide some context to the document and then we can ask the machine questions like,

what documents are similar to each other Siri?

By evaluating TF-IDF or a number of “the words used in a sentence vs words used in overall document”, we understand -

how useful a word is to a sentence (which helps us understand the importance of a word in a sentence).
how useful a word is to a document (which helps us understand the important words with more frequencies in a document).
helps us ignore words that are misspelled (using n-gram technique) an example of which I am covering below

Imagine in a document you misspelled ‘example’ as ‘exaple’ and you forgot to go back and change it before giving it to a machine to read -

In case of BOW, both ‘example’ and ‘exaple’ would be treated as different words and given the same importance because their frequency is same.

But in case of TD-IDF because of a score of IDF, this mistake is corrected because we know example as a word is more important than exaple, so we treat it like a non useful word.

Now because of these scores our machine has a better understanding of these documents and can be asked to compare these documents, find similar documents, find opposite documents, find similarities in document and can be used by machine to recommend you what to read next, cool right?

Now, I am guessing you need a minute to go back and grasp this concept again before I tell you how to do it, ofcourse I’ll take up an example so if you’re conceptually hazy but almost clear you’ll definitelly be alright once you practise with the example.

The process to find meaning of documents using TF-IDF is very similar to Bag of words,

Clean data / Preprocessing — Clean data (standardise data) , Normalize data( all lower case) , lemmatize data ( all words to root words ).
Tokenize words with frequency
Find TF for words
Find IDF for words
Vectorize vocab

(if you’re unfamiliar with what these are, I recommend reading the article BOW I shared on top to get a clear understanding of how to do these).

I’ll be using these techniques to cover the example below so I hope you’re familiar with them.

Let’s cover an example of 3 documents -

Document 1 It is going to rain today.

Document 2 Today I am not going outside.

Document 3 I am going to watch the season premiere.

To find TF-IDF we need to perform the steps we laid out above, let’s get to it.

TF-IDF/Term Frequency Technique: Easiest explanation for Text classification in NLP with Python (3)

Document 1—

It is going to rain today.

Find it’s TF = (Number of repetitions of word in a document) / (# of words in a document)

TF-IDF/Term Frequency Technique: Easiest explanation for Text classification in NLP with Python (4)

Continue for rest of sentences -

TF-IDF/Term Frequency Technique: Easiest explanation for Text classification in NLP with Python (5)

Find IDF for documents (we do this for feature names only/ vocab words which have no stop words )

IDF =Log[(Number of documents) / (Number of documents containing the word)]

TF-IDF/Term Frequency Technique: Easiest explanation for Text classification in NLP with Python (6)

TF-IDF/Term Frequency Technique: Easiest explanation for Text classification in NLP with Python (7)

TF-IDF/Term Frequency Technique: Easiest explanation for Text classification in NLP with Python (8)

You can easily see using this table that words like ‘it’,’is’,’rain’ are important for document 1 but not for document 2 and document 3 which means Document 1 and 2&3 are different w.r.t talking about rain.

You can also say that Document 1 and 2 talk about something ‘today’, and document 2 and 3 discuss something about the writer because of the word ‘I’.

This table helps you find similarities and non similarities btw documents, words and more much much better than BOW.

If you want to see a video of the example I picked, checkout the video of the same. Check video

Challenge is to use these sentences and find words which provide meaning to these sentences using TF-IDF, ok?

Let’s begin

#Part 1 Declaring all documents and assigning to a Vocab document

Document1= “It is going to rain today.”
Document2= “Today I am not going outside.”
Document3= “I am going to watch the season premiere.”
Doc = [Document1 ,
 Document2 , 
 Document3]
print(Doc)
Output>>>
[‘It is going to rain today.’, ‘Today I am not going outside.’, ‘I am going to watch the season premiere.’]

#Part 2 —intializing TFIDFVectorizer

from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer()

Simple how easy to deploy TF-IDF , right ?

#Part 3 — Getting feature names of final words that we will use to tag documents

analyze = vectorizer.build_analyzer()print(‘Document 1’,analyze(Document1))print(‘Document 2’,analyze(Document2))print(‘Document 3’,analyze(Document3))print(‘Document transform’,X.toarray())Output>>>Document 1 [‘it’, ‘is’, ‘going’, ‘to’, ‘rain’, ‘today’]Document 2 [‘today’, ‘am’, ‘not’, ‘going’, ‘outside’]Document 3 [‘am’, ‘going’, ‘to’, ‘watch’, ‘the’, ‘season’, ‘premiere’] Document transform [[0. 0.27824521 0.4711101 0.4711101 0. 0. 0. 0.4711101 0. 0. 0.35829137 0.35829137 0. ] [0.40619178 0.31544415 0. 0. 0.53409337 0.53409337 0. 0. 0. 0. 0. 0.40619178 0. ] [0.32412354 0.25171084 0. 0. 0. 0. 0.4261835 0. 0.4261835 0.4261835 0.32412354 0. 0.4261835 ]]

See how each sentence is broken in words and each word is represented as a number for the machine, I’ve broken both above.

#Part 4 — Vectorizing or creating a matrix of all three documents and finding feature names

X = vectorizer.fit_transform(Doc)print(vectorizer.get_feature_names())Output>>> 
[‘am’, ‘going’, ‘is’, ‘it’, ‘not’, ‘outside’, ‘premiere’, ‘rain’, ‘season’, ‘the’, ‘to’, ‘today’, ‘watch’]

The output signifies the important words which add context to 3 sentences. These are the words that are important in all 3 sentences and now you can ask questions of whatever nature you like to the machine, stuff like

What are similar documents?
When will it rain ?
I am done, what to read next ?

Because the machine has a score to help aid with these questions, TF-IDF proves a great tool to train machine to answer back in case of chatbots as well.

If you would like to view the full code -

Go checkout my Github here > Check Bag of words code.