TF-IDF — Term Frequency-Inverse Document Frequency – LearnDataSci (2024)

TF-IDF — Term Frequency-Inverse Document Frequency – LearnDataSci (1)

Author: Fatih Karabiber
Ph.D. in Computer Engineering, Data Scientist

What is TF-IDF?

Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural language processing and information retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., relative to a corpus).

Words within a text document are transformed into importance numbers by a text vectorization process. There are many different text vectorization scoring schemes, with TF-IDF being one of the most common.

🚀 Start Your Own Analytics Consulting Company

Go from side-hustling to earning enough to quit your job. Check it out →

As its name implies, TF-IDF vectorizes/scores a word by multiplying the word’s Term Frequency (TF) with the Inverse Document Frequency (IDF).

Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.

$$ TF = \frac {\textrm{number of times the term appears in the document} }{ \textrm{total number of terms in the document}} $$

Inverse Document Frequency: IDF of a term reflects the proportion of documents in the corpus that contain the term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and).

$$ IDF = log (\frac {\textrm{number of the documents in the corpus}} {\textrm{number of documents in the corpus contain the term}}) $$

The TF-IDF of a term is calculated by multiplying TF and IDF scores.

$$ \textit{TF-IDF} = TF * IDF $$

Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF. The resulting TF-IDF score reflects the importance of a term for a document in the corpus.

TF-IDF is useful in many natural language processing applications. For example, Search Engines use TF-IDF to rank the relevance of a document for a query. TF-IDF is also employed in text classification, text summarization, and topic modeling.

Note that there are some different approaches to calculating the IDF score. The base 10 logarithm is often used in the calculation. However, some libraries use a natural logarithm. In addition, one can be added to the denominator as follows in order to avoid division by zero.

$$ IDF = log (\frac {\textrm{number of the documents in the corpus}} {\textrm{number of documents in the corpus contain the term} +1}) $$

Numerical Example

Imagine the term $t$ appears 20 times in a document that contains a total of 100 words. Term Frequency (TF) of $t$ can be calculated as follow:

$$ TF= \frac{20}{100} = 0.2 $$

Assume a collection of related documents contains 10,000 documents. If 100 documents out of 10,000 documents contain the term $t$, Inverse Document Frequency (IDF) of $t$ can be calculated as follows

$$ IDF = log \frac{10000}{100} = 2 $$

Using these two quantities, we can calculate TF-IDF score of the term $t$ for the document.

$$ \textit{TF-IDF} = 0.2 * 2 = 0.4 $$

Python Implementation

Some popular python libraries have a function to calculate TF-IDF. The popular machine learning library Sklearn has TfidfVectorizer() function (docs).

We will write a TF-IDF function from scratch using the standard formula given above, but we will not apply any preprocessing operations such as stop words removal, stemming, punctuation removal, or lowercasing. It should be noted that the result may be different when using a native function built into a library.

import pandas as pdimport numpy as np
Learn Data Science with

First, let's construct a small corpus.

corpus = ['data science is one of the most important fields of science', 'this is one of the best data science courses', 'data scientists analyze data' ]
Learn Data Science with

Next, we'll create a word set for the corpus:

words_set = set()for doc in corpus: words = doc.split(' ') words_set = words_set.union(set(words)) print('Number of words in the corpus:',len(words_set))print('The words in the corpus: \n', words_set)
Learn Data Science with

Out:

Number of words in the corpus: 14The words in the corpus: {'important', 'scientists', 'best', 'courses', 'this', 'analyze', 'of', 'most', 'the', 'is', 'science', 'fields', 'one', 'data'}
Learn Data Science with

Computing Term Frequency

Now we can create a dataframe by the number of documents in the corpus and the word set, and use that information to compute the term frequency (TF):

n_docs = len(corpus) #Number of documents in the corpusn_words_set = len(words_set) #Number of unique words in the df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=words_set)# Compute Term Frequency (TF)for i in range(n_docs): words = corpus[i].split(' ') # Words in the document for w in words: df_tf[w][i] = df_tf[w][i] + (1 / len(words)) df_tf
Learn Data Science with

Out:

importantscientistsbestcoursesthisanalyzeofmosttheissciencefieldsonedata
00.0909090.000.0000000.0000000.0000000.000.1818180.0909090.0909090.0909090.1818180.0909090.0909090.090909
10.0000000.000.1111110.1111110.1111110.000.1111110.0000000.1111110.1111110.1111110.0000000.1111110.111111
20.0000000.250.0000000.0000000.0000000.250.0000000.0000000.0000000.0000000.0000000.0000000.0000000.500000

The dataframe above shows we have a column for each word and a row for each document. This shows the frequency of each word in each document.

Computing Inverse Document Frequency

Now, we'll compute the inverse document frequency (IDF):

print("IDF of: ")idf = {}for w in words_set: k = 0 # number of documents in the corpus that contain this word for i in range(n_docs): if w in corpus[i].split(): k += 1 idf[w] = np.log10(n_docs / k) print(f'{w:>15}: {idf[w]:>10}' )
Learn Data Science with

Out:

IDF of: important: 0.47712125471966244 scientists: 0.47712125471966244 best: 0.47712125471966244 courses: 0.47712125471966244 this: 0.47712125471966244 analyze: 0.47712125471966244 of: 0.17609125905568124 most: 0.47712125471966244 the: 0.17609125905568124 is: 0.17609125905568124 science: 0.17609125905568124 fields: 0.47712125471966244 one: 0.17609125905568124 data: 0.0
Learn Data Science with

Putting it Together: Computing TF-IDF

Since we have TF and IDF now, we can compute TF-IDF:

df_tf_idf = df_tf.copy()for w in words_set: for i in range(n_docs): df_tf_idf[w][i] = df_tf[w][i] * idf[w] df_tf_idf
Learn Data Science with

Out:

importantscientistsbestcoursesthisanalyzeofmosttheissciencefieldsonedata
00.0433750.000000.0000000.0000000.0000000.000000.0320170.0433750.0160080.0160080.0320170.0433750.0160080.0
10.0000000.000000.0530130.0530130.0530130.000000.0195660.0000000.0195660.0195660.0195660.0000000.0195660.0
20.0000000.119280.0000000.0000000.0000000.119280.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0

Notice that "data" has an IDF of 0 because it appears in every document. As a result, is not considered to be an important term in this corpus. This will change slightly in the following sklearn implementation, where "data" will be non-zero.

TF-IDF Using scikit-learn

First, we need to import sklearn's TfidfVectorizer:

from sklearn.feature_extraction.text import TfidfVectorizer
Learn Data Science with

We need to instantiate the class first, then we can call the fit_transform method on our test corpus. This will perform all of the calculations we performed above.

tr_idf_model = TfidfVectorizer()tf_idf_vector = tr_idf_model.fit_transform(corpus)
Learn Data Science with

After vectorizing the corpus by the function, a sparse matrix is obtained.

Here's the current shape of the matrix:

print(type(tf_idf_vector), tf_idf_vector.shape)
Learn Data Science with

Out:

<class 'scipy.sparse.csr.csr_matrix'> (3, 14)
Learn Data Science with

And we can convert to an regular array to get a better idea of the values:

tf_idf_array = tf_idf_vector.toarray()print(tf_idf_array)
Learn Data Science with

Out:

[[0. 0. 0. 0.18952581 0.32089509 0.32089509 0.24404899 0.32089509 0.48809797 0.24404899 0.48809797 0. 0.24404899 0. ] [0. 0.40029393 0.40029393 0.23642005 0. 0. 0.30443385 0. 0.30443385 0.30443385 0.30443385 0. 0.30443385 0.40029393] [0.54270061 0. 0. 0.64105545 0. 0. 0. 0. 0. 0. 0. 0.54270061 0. 0. ]]
Learn Data Science with

It's now very straightforward to obtain the original terms in the corpus by using get_feature_names:

words_set = tr_idf_model.get_feature_names()print(words_set)
Learn Data Science with

Out:

['analyze', 'best', 'courses', 'data', 'fields', 'important', 'is', 'most', 'of', 'one', 'science', 'scientists', 'the', 'this']
Learn Data Science with

Finally, we'll create a dataframe to better show the TF-IDF scores of each document:

df_tf_idf = pd.DataFrame(tf_idf_array, columns = words_set)df_tf_idf
Learn Data Science with

Out:

analyzebestcoursesdatafieldsimportantismostofonesciencescientiststhethis
00.0000000.0000000.0000000.1895260.3208950.3208950.2440490.3208950.4880980.2440490.4880980.0000000.2440490.000000
10.0000000.4002940.4002940.2364200.0000000.0000000.3044340.0000000.3044340.3044340.3044340.0000000.3044340.400294
20.5427010.0000000.0000000.6410550.0000000.0000000.0000000.0000000.0000000.0000000.0000000.5427010.0000000.000000

As you can see from the output above, the TF-IDF scores are different than the scores obtained by the manual process we used earlier. This difference is due to sklearn's implementation of TF-IDF, which uses a slightly different formula. For more details, you can learn more about how sklearn calculates TF-IDF term weighting here.

TF-IDF — Term Frequency-Inverse Document Frequency – LearnDataSci (2024)

FAQs

How to calculate term frequency and inverse document frequency? ›

Term frequency can be determined by counting the number of occurrences of a term in a document. IDF is calculated by dividing the total number of documents by the number of documents in the collection containing the term. It's useful for reducing the weight of terms that are common within a collection of documents.

What is the inverse document frequency IDF? ›

Inverse Document Frequency (IDF) is a weight indicating how commonly a word is used. The more frequent its usage across documents, the lower its score. The lower the score, the less important the word becomes.

What is term frequency times inverse document frequency? ›

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

What is NLP term frequency-inverse document frequency? ›

Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural language processing and information retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., relative to a corpus).

What is the TF-IDF formula? ›

There are variations on the TF-IDF formula, but this is the most widely-used version. In plain English, the formula is: word frequency in a given document times the log of total number of documents over the number of documents containing the word.

What is the difference between term frequency and inverse document frequency in IRT? ›

The main difference between term frequency (TF) and inverse document frequency (IDF) lies in their purpose and calculation. TF measures the frequency of a term within a single document, while IDF measures the rarity of a term across a collection of documents.

What is term frequency inverse document frequency online? ›

TF stands for Term frequency, the weight of a term that occurs in a document is simply proportional to the term frequency. IDF stands for "inverse document frequency” or how popular a term is on other sites.

What is the inverse of frequency? ›

Frequency and period are the inverse of each other!

What is the difference between the BoW and the TF-IDF? ›

The BoW creates a series of vectors that contains the occurrences of a word in the document. Whereas the Tf-Idf holds information on the more important words and the less important ones separately. Also, interpreting the BoW vector is easy.

What is the formula for IDF log base? ›

IDF is defined according to the formula: IDF(t) = log( N / d(t) ), where N is the number of documents in the collection and d(t) is the number of documents in the collection where the term t occurs. NOTE: Here we employ logarithm of base 2.

What is the inverse document frequency medium? ›

​Inverse Document Frequency (IDF)

IDF measures the rarity of a term across a collection of documents. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term. The goal is to penalize words that are common across all documents.

Why is frequency inverse of time period? ›

In the derived relation, the unit for time is second, and the unit for frequency is Hertz (periods per second or cycle per second). The frequency and period are inverse functions because when the frequency is high, then the time period is low, or when the time period is low, then the frequency is increased.

What is an example of inverse document frequency IDF? ›

Inverse document frequency Term frequency is how common a word is, inverse document frequency (IDF) is how unique or rare a word is. Example, Consider a document containing 100 words wherein the word apple appears 5 times. The term frequency (ie, TF) for apple is then (5 / 100) = 0.05.

Is TF-IDF a word embedding? ›

TF-IDF is another way of performing these embeddings that improves the representation of words in our model by weighting them. TF-IDF is often used as an intermediate step in some of the more advanced models we will construct later.

What are the applications of TF-IDF in NLP? ›

A. TF-IDF (Term Frequency-Inverse Document Frequency) is used in NLP to assess the importance of words in a document relative to a collection of documents. It helps identify key terms by considering both their frequency and uniqueness.

What is term frequency-inverse document frequency in Python? ›

TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify words in a set of documents. We generally compute a score for each word to signify its importance in the document and corpus. This method is a widely used technique in Information Retrieval and Text Mining.

What is term frequency-inverse document frequency online? ›

TF stands for Term frequency, the weight of a term that occurs in a document is simply proportional to the term frequency. IDF stands for "inverse document frequency” or how popular a term is on other sites.

What is TF-IDF, and BM25? ›

Intro. TFIDF (term frequency-inverse document frequency: wiki link) and BM25 (Okapi Best Matching 25: wiki link) are two methods for document searchs. The typical use case is when you have 1000 documents, and you want to retrieve the best matching document for the search query “dog”.

What is the difference between bag of words and TF-IDF? ›

Bag of Words just creates a set of vectors containing the count of word occurrences in the document (reviews), while the TF-IDF model contains information on the more important words and the less important ones as well. Bag of Words vectors are easy to interpret.

Top Articles
Latest Posts
Article information

Author: Nicola Considine CPA

Last Updated:

Views: 5243

Rating: 4.9 / 5 (49 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Nicola Considine CPA

Birthday: 1993-02-26

Address: 3809 Clinton Inlet, East Aleisha, UT 46318-2392

Phone: +2681424145499

Job: Government Technician

Hobby: Calligraphy, Lego building, Worldbuilding, Shooting, Bird watching, Shopping, Cooking

Introduction: My name is Nicola Considine CPA, I am a determined, witty, powerful, brainy, open, smiling, proud person who loves writing and wants to share my knowledge and understanding with you.