TF-IDF for Document Ranking from scratch in python on real world dataset. (2024)

Published in

Towards Data Science

19 min read

Feb 15, 2019

Terminology

t — term (word)
d — document (set of words)
N — count of corpus
corpus — the total document set

Term Frequency

This measures the frequency of a word in a document. This highly depends on the length of the document and the generality of the word, for example, a very common word such as “was” can appear multiple times in a document. But if we take two documents with 100 words and 10,000 words respectively, there is a high probability that the common word “was” is present more in the 10,000 worded document. But we cannot say that the longer document is more important than the shorter document. For this exact reason, we perform normalization on the frequency value, we divide the frequency with the total number of words in the document.

Recall that we need to finally vectorize the document. When we plan to vectorize documents, we cannot just consider the words that are present in that particular document. If we do that, then the vector length will be different for both the documents, and it will not be feasible to compute the similarity. So, what we do is that we vectorize the documents on the vocab. Vocab is the list of all possible worlds in the corpus.

We need the word counts of all the vocab words and the length of the document to compute TF. In case the term doesn’t exist in a particular document, that particular TF value will be 0 for that particular document. In an extreme case, if all the words in the document are the same, then TF will be 1. The final value of the normalised TF value will be in the range of [0 to 1]. 0, 1 inclusive.

TF is individual to each document and word, hence we can formulate TF as follows:

tf(t,d) = count of t in d / number of words in d

If we already computed the TF value and if this produces a vectorized form of the document, why not use just TF to find the relevance between documents? Why do we need IDF?

Let me explain, words which are most common such as ‘is’, ‘are’ will have very high values, giving those words very high importance. But using these words to compute the relevance produces bad results. These kinds of common words are called stop-words. Although we will remove the stop words later in the preprocessing step, finding the presence of the word across the documents and somehow reduce their weightage is more ideal.

Document Frequency

This measures the importance of documents in a whole set of the corpus. This is very similar to TF but the only difference is that TF is the frequency counter for a term t in document d, whereas DF is the count of occurrences of term t in the document set N. In other words, DF is the number of documents in which the word is present. We consider one occurrence if the term is present in the document at least once, we do not need to know the number of times the term is present.

df(t) = occurrence of t in N documents

To keep this also in a range, we normalize by dividing by the total number of documents. Our main goal is to know the informativeness of a term, and DF is the exact inverse of it. that is why we inverse the DF

Inverse Document Frequency

IDF is the inverse of the document frequency which measures the informativeness of term t. When we calculate IDF, it will be very low for the most occurring words such as stop words (because they are present in almost all of the documents, and N/df will give a very low value to that word). This finally gives what we want, a relative weightage.

idf(t) = N/df

Now there are few other problems with the IDF, when we have a large corpus size say N=10000, the IDF value explodes. So to dampen the effect we take the log of IDF.

At query time, when the word is not present in is not in the vocab, it will simply be ignored. But in few cases, we use a fixed vocab and few words of the vocab might be absent in the document, in such cases, the df will be 0. As we cannot divide by 0, we smoothen the value by adding 1 to the denominator.

idf(t) = log(N/(df + 1))

Finally, by taking a multiplicative value of TF and IDF, we get the TF-IDF score. There are many different variations of TF-IDF but for now, let us concentrate on this basic version.

tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

I’m a Senior Data Scientist and AI researcher in the field of NLP and DL.
Connect with me: Twitter, LinkedIn.

Now that we learnt what is TF-IDF let us compute the similarity score on a dataset.

The dataset we are going to use are archives of few stories, this dataset has lots of documents in different formats. Download the dataset and open your notebooks, Jupyter Notebooks I mean 😜.

Dataset Link: http://archives.textfiles.com/stories.zip

Step 1: Analysing Dataset

The first step in any of the Machine Learning tasks is to analyse the data. So if we look at the dataset, at first glance, we see all the documents with words in English. Each document has different names and there are two folders in it.

Now one of the important tasks is to identify the title in the body, if we analyse the documents, there are different patterns of alignment of title. But most of the titles are centre aligned. Now we need to figure out a way to extract the title. But before we get all pumped up and start coding, let us analyse the dataset little deep.

Take few minutes to analyse the dataset yourself. Try to explore…

Upon more inspection, we can notice that there’s an index.html in each folder (including the root), which contains all the document names and their titles. So, let us consider ourselves lucky as the titles are given to us, without exhaustively extracting titles from each document.

Step 2: Extracting Title & Body:

There is no specific way to do this, this totally depends on the problem statement at hand and on the analysis, we do on the dataset.

As we have already found that the titles and the document names are in the index.html, we need to extract those names and titles. We are lucky that index.html has tags that we can use as patterns to extract our required content.

TF-IDF for Document Ranking from scratch in python on real world dataset. (4)

Before we start extracting the titles and file names, as we have different folders, first let’s crawl the folders to later read all the index.html files at once.

[x[0] for x in os.walk(str(os.getcwd())+’/stories/’)]

os.walk gives us the files in the directory, os.getcwd gives us the current directory and title and we are going to search in the current directory + stories folder as our data files are in the stories folder.

Lowercase

During the text processing, each sentence is split into words and each word is considered as a token after preprocessing. Programming languages consider textual data as sensitive, which means that The is different from the. we humans know that those both belong to the same token but due to the character encoding those are considered as different tokens. Converting to lowercase is a very mandatory preprocessing step. As we have all our data in the list, numpy has a method that can convert the list of lists to lowercase at once.

np.char.lower(data)

Stop words

Stop words are the most commonly occurring words that don’t give any additional value to the document vector. in-fact removing these will increase computation and space efficiency. nltk library has a method to download the stopwords, so instead of explicitly mentioning all the stopwords ourselves we can just use the nltk library and iterate over all the words and remove the stop words. There are many efficient ways to do this, but ill just give a simple method.

TF-IDF for Document Ranking from scratch in python on real world dataset. (7)

we are going to iterate over all the stop words and not append them to the list if it’s a stop word

new_text = ""
for word in words:
 if word not in stop_words:
 new_text = new_text + " " + word

Punctuation

Punctuation is the set of unnecessary symbols that are in our corpus documents. We should be a little careful with what we are doing with this, there might be few problems such as U.S — us “United Stated” being converted to “us” after the preprocessing. hyphen and should usually be dealt with little care. But for this problem statement, we are just going to remove these

symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
for i in symbols:
 data = np.char.replace(data, i, ' ')

We are going to store all our symbols in a variable and iterate that variable removing that particular symbol in the whole dataset. we are using numpy here because our data is stored in a list of lists, and numpy is our best bet.

Apostrophe

Note that there is no ‘ apostrophe in the punctuation symbols. Because when we remove punctuation first it will convert don’t to dont, and it is a stop word that won't be removed. What we will do instead, is removing the stop words first followed by symbols and then finally repeat stopword removal as few words might still have an apostrophe that are not stopwords.

return np.char.replace(data, "'", "")

Single Characters

Single characters are not much useful in knowing the importance of the document and few final single characters might be irrelevant symbols, so it is always good to remove the single characters.

new_text = ""
for w in words:
 if len(w) > 1:
 new_text = new_text + " " + w

We just need to iterate to all the words and not append the word if the length is not greater than 1.

Stemming

This is the final and most important part of the preprocessing. stemming converts words to their stem.

For example, playing and played are the same type of words that basically indicate an action play. Stemmer does exactly this, it reduces the word to its stem. we are going to use a library called porter-stemmer which is a rule-based stemmer. Porter-Stemmer identifies and removes the suffix or affix of a word. The words given by the stemmer need not be meaningful few times, but it will be identified as a single token for the model.

TF-IDF for Document Ranking from scratch in python on real world dataset. (8)

Lemmatisation

Lemmatisation is a way to reduce the word to the root synonym of a word. Unlike Stemming, Lemmatisation makes sure that the reduced word is again a dictionary word (word present in the same language). WordNetLemmatizer can be used to lemmatize any word.

Stemming vs Lemmatization

stemming — need not be a dictionary word, removes prefix and affix based on few rules

lemmatization — will be a dictionary word. reduces to a root synonym.

TF-IDF for Document Ranking from scratch in python on real world dataset. (9)

A better efficient way to proceed is to first lemmatise and then stem, but stemming alone is also fine for few problems statements, here we will not lemmatise.

Converting Numbers

When a user gives a query such as 100 dollars or hundred dollars. For the user, both those search terms are the same. but our IR model treats them separately, as we are storing 100, dollars, hundred as different tokens. So to make our IR mode a little better we need to convert 100 to hundred. To achieve this we are going to use a library called num2word.

TF-IDF for Document Ranking from scratch in python on real world dataset. (10)

If we look a little close to the above output, it is giving us few symbols and sentences such as “one hundred and two”, but damn we just cleaned our data, then how do we handle this? No worries, we will just run the punctuation and stop words again after converting numbers to words.

Preprocessing

Finally, we are going to put in all those preprocessing methods above in another method and we will call that preprocess method.

def preprocess(data):
 data = convert_lower_case(data)
 data = remove_punctuation(data)
 data = remove_apostrophe(data)
 data = remove_single_characters(data)
 data = convert_numbers(data)
 data = remove_stop_words(data)
 data = stemming(data)
 data = remove_punctuation(data)
 data = convert_numbers(data)

If you look closely, a few of the preprocessing methods are repeated again. As discussed, this just helps clean the data little deep. Now we need to read the documents and store their title and the body separately as we are going to use them later. In our problem statement, we have very different types of documents, this can cause few errors in reading the documents due to encoding compatibility. to resolve this, just use encoding=”utf8", errors=’ignore’ in the open() method.

Step 3: Calculating TF-IDF

Recall that we need to give different weights to title and body. Now how are we going to handle that issue? how will the calculation of TF-IDF work in this case?

Giving different weights to title and body is a very common approach. We just need to consider the document as body + title, using this we can find the vocab. And we need to give different weights to words in the title and different weights to the words in the body. To better explain this, let us consider an example.

title = “This is a novel paper”

body = “This paper consists of survey of many papers”

Now, we need to calculate the TF-IDF for body and for the title. For the time being let us consider only the word paper, and forget about removing stop words.

What is the TF of word paper in the title? 1/4?

No, it’s 3/13. How? word paper appears in title and body 3 times and the total number of words in title and body is 13. As I mentioned before, we just consider the word in the title to have different weights, but still, we consider the whole document when calculating TF-IDF.

Then the TF of paper in both title and body is the same? Yes, it’s the same! it’s just the difference in weights that we are going to give. If the word is present in both title and body, then there wouldn't be any reduction in the TF-IDF value. If the word is present only in the title, then the weight of the body for that particular word will not add to the TF of that word, and vice versa.

document = body + title
TF-IDF(document) = TF-IDF(title) * alpha + TF-IDF(body) * (1-alpha)

Calculating DF

Let us be smart and calculate DF beforehand. We need to iterate through all the words in all the documents and store the document id’s for each word. For this, we will use a dictionary as we can use the word as the key and a set of documents as the value. I mentioned set because, even if we are trying to add the document multiple times, a set will not just take duplicate values.

DF = {}
for i in range(len(processed_text)):
 tokens = processed_text[i]
 for w in tokens:
 try:
 DF[w].add(i)
 except:
 DF[w] = {i}

We are going to create a set if the word doesn’t have a set yet else add it to the set. This condition is checked by the try block. Here processed_text is the body of the document, and we are going to repeat the same for the title as well, as we need to consider the DF of the whole document.

len(DF) will give the unique words

DF will have the word as the key and the list of doc id’s as the value. but for DF we don’t actually need the list of docs, we just need the count. so we are going to replace the list with its count.

TF-IDF for Document Ranking from scratch in python on real world dataset. (11)

There we have it, the count we need for all the words. To find the total unique words in our vocabulary, we need to take all the keys of DF.

TF-IDF for Document Ranking from scratch in python on real world dataset. (12)

Calculating TF-IDF

Recall that we need to maintain different weights for title and body. To calculate TF-IDF of body or title we need to consider both the title and body. To make our job a little easier, let’s use a dictionary with (document, token) pair as key and any TF-IDF score as the value. We just need to iterate over all the documents, we can use the Coutner which can give us the frequency of the tokens, calculate tf and idf and finally store as a (doc, token) pair in tf_idf. tf_idf dictionary is for the body, we will use the same logic to build a dictionary tf_idf_title for the words in the title.

tf_idf = {}
for i in range(N):
 tokens = processed_text[i]
 counter = Counter(tokens + processed_title[i])
 for token in np.unique(tokens):
 tf = counter[token]/words_count
 df = doc_freq(token)
 idf = np.log(N/(df+1))
 tf_idf[doc, token] = tf*idf

Coming to the calculation of different weights. Firstly, we need to maintain a value alpha, which is the weight for the body, then obviously 1-alpha will be the weight for the title. Now let us delve into a little math, we discussed that TF-IDF value of a word will be the same for both body and title if the word is present in both places. We will maintain two different tf-idf dictionaries, one for the body and one for the title.

What we are going to do is a little smart, we will calculate TF-IDF for the body; multiply the whole body TF-IDF values with alpha; iterate the tokens in the title; replace the title TF-IDF value in the body TF-IDF value of the (document, token) pair exists. Take some time to process this :P

Flow:
- Calculate TF-IDF for Body for all docs
- Calculate TF-IDF for title for all docs
- multiply the Body TF-IDF with alpha
- Iterate Title IF-IDF for every (doc, token)
— if token is in body, replace the Body(doc, token) value with the value in Title(doc, token)

I know this is not easy at first to understand, but still let me explain why the above flow works, as we know that the tf-idf for body and title will be the same if the token is in both places, The weights that we use for body and title sum up to one

TF-IDF = body_tf-idf * body_weight + title_tf-idf*title_weight

body_weight + title_weight = 1

When a token is in both places, then the final TF-IDF will be the same as taking either body or title tf_idf. That is exactly what we are doing in the above flow. So, finally, we have a dictionary tf_idf which has the values as a (doc, token) pair.

Matching score is the simplest way to calculate the similarity, in this method, we add tf_idf values of the tokens that are in query for every document. For example, for the query “hello world”, we need to check in every document if these words exist and if the word exists, then the tf_idf value is added to the matching score of that particular doc_id. In the end, we will sort and take the top k documents.

Mentioned above is the theoretical concept, but as we are using a dictionary to hold our dataset, what we are going to do is we will iterate over all of the values in the dictionary and check if the value is present in the token. As our dictionary is a (document, token) key, when we find a token that is in the query we will add the document id to another dictionary along with the tf-idf value. Finally, we will just take the top k documents again.

def matching_score(query):
 query_weights = {}
 for key in tf_idf: 
 if key[1] in tokens:
 query_weights[key[0]] += tf_idf[key]

key[0] is the documentid, key[1] is the token.

When we have a perfectly working Matching Score, why do we need cosine similarity again? though Matching Score gives relevant documents, it quite fails when we give long queries, it will not be able to rank them properly. What cosine similarly does is that it will mark all the documents as vectors of tf-idf tokens and measures the similarity in cosine space (the angle between the vectors. Few times the query length would be small but it might be closely related to the document in such cases cosine similarity is the best to find relevance.

TF-IDF for Document Ranking from scratch in python on real world dataset. (13)

Observe the above plot, the blue vectors are the documents and the red vector is the query, as we can clearly see, though the manhattan distance (green line) is very high for document d1, the query is still close to document d1. In such cases, cosine similarity would be better as it considers the angle between those two vectors. But Matching Score will return document d3 but that is not very closely related.

Matching Score computes manhattan distance (straight line from tips)
Cosine score considers the angle of the vectors.

Vectorization

To compute any of the above, the simplest way is to convert everything to a vector and then compute the cosine similarity. So, let’s convert the query and documents to vectors. We are going to use total_vocab variable which has all the list of unique tokens to generate an index for each token, and we will use numpy of shape (docs, total_vocab) to store the document vectors.

# Document Vectorization
D = np.zeros((N, total_vocab_size))
for i in tf_idf:
 ind = total_vocab.index(i[1])
 D[i[0]][ind] = tf_idf[i]

For vector, we need to calculate the TF-IDF values, TF we can calculate from the query itself, and we can make use of DF that we created for the document frequency. Finally, we will store in a (1,vocab_size) numpy array to store the tf-idf values, index of the token will be decided from the total_voab list

Q = np.zeros((len(total_vocab)))
counter = Counter(tokens)
words_count = len(tokens)
query_weights = {}
for token in np.unique(tokens):
 tf = counter[token]/words_count
 df = doc_freq(token)
 idf = math.log((N+1)/(df+1))

Now, all we have to do is calculate the cosine similarity for all the documents and return the maximum k documents. Cosine similarity is defined as follows.

np.dot(a, b)/(norm(a)*norm(b))

I took the text from doc_id 200 (for me) and pasted some content with long query and short query in both matching score and cosine similarity.

Short Query

TF-IDF for Document Ranking from scratch in python on real world dataset. (14)

TF-IDF for Document Ranking from scratch in python on real world dataset. (15)

Long Query

TF-IDF for Document Ranking from scratch in python on real world dataset. (16)

TF-IDF for Document Ranking from scratch in python on real world dataset. (17)

As we can see from the above document 200 is always highly rated in the cosine method than the matching method, this is because cosine similarity learns the context more.

About Me:

I’m a Senior Data Scientist and AI researcher in the field of NLP and DL.
Would love to connect: Twitter, LinkedIn.

Try out yourself. Click here for git repo.

Libraries Used

nltk, numpy, re, mat, num2words

1. Introduction
2. Unigram Indexing & Positional Indexing
3. TF-IDF
More to come…