How to create content recommendations using TF IDF (2024)

After work, when I’m not learning about data science, practising data science, or writing about data science, I like to browse classic car auction sites looking for cars I can’t afford to buy, don’t have enough room to house, and whose purchase would lead to divorce and bankruptcy.

One of my favourite of such sites is The Market, as it includes well-written product copy that other car auction sites don’t have. However, while its inventory is small, it currently lacks a recommendation engine that serves up other cars I might like to imagine I could afford to buy.

I totally get why The Market doesn’t have recommendations. The number of cars sold is very low, there are a limited number of concurrent auctions, and most people make a single purchase, so a regular “customers who bought this also bought” model would be useless.

Content-based recommendations

However, despite the lack of sales data normally required to generate product recommendations, there’s still a way that these could be added. We could generate recommendations based on content similarity instead.

For example, if you’re looking at a listing for a Ferrari 308 GTB, you might also be interested in checking out the 308 GTS. We can do this via two Natural Language Processing (NLP) techniques: Term-Frequency Inverse Document Frequency or TF-IDF, and cosine similarity.

Term Frequency Inverse Document Frequency (TF-IDF)

TF-IDF is a statistic which show the importance of specific words in a document versus the other documents in collection of documents, or “corpus”. Basically, TF-IDF counts up the number of times a given phrase occurs within a document and compares it to other documents.

If a page contains the words “Ferrari 308” numerous times, and other documents in the corpus do not, then it’s probable that the document is about the “Ferrari 308”. Simply find all the documents where the scores for a phrase are high and you’ve got your matches.

Cosine similarity

Cosine similarity measures the similarity between two vectors. Since TF-IDF returns vectors showing the score a document gets versus the corpus, we can use cosine similarity to identify the closest matches after we’ve used TF-IDF to generate the vectors.

I’ll skip the complicated maths, but basically we first generate the TF-IDF vectors containing the raw numbers, and then use cosine similarity to check these across all documents. We can then sort the output and identify the closest matches based on their text similarity.

How to create content recommendations using TF IDF (1)Picture by Sid Ramirez, Unsplash.

Import the packages

To get started, open up a Jupyter notebook and import pandas, numpy, the TfidfVectorizer, cosine_similarity and linear_kernel modules from scikit-learn.

import pandas as pdimport numpy as npfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityfrom sklearn.metrics.pairwise import linear_kernel

Load the data

Next, load up your dataset. I’m using some product descriptions I scraped from The Market, but you can use product page content, blog posts, or anything else you have which is similar.

df = pd.read_csv('themarket_pages.csv')df.sample(10)
url title description h1 html image text
732 1969 MG MGC GT AUTOMATIC For Sale by Auction ['This MGC is originally a Channel Islands car... 1969 MG MGC GT AUTOMATIC <!doctype html>\n<html class="no-js" lang="en"... [' 1969 MG MGC GT AUTOMATIC\nBackground\nOnly pro...
684 2004 Mercedes-Benz SL65 AMG For Sale by Auction ['With just 25,500 miles on the odometer, this... 2004 Mercedes-Benz SL65 AMG <!doctype html>\n<html class="no-js" lang="en"... [' 2004 Mercedes-Benz SL65 AMG\nBackground\nFollo...
530 1959 LAND ROVER SERIES II LWB For Sale by Auction ['Spending the first third of its life oversea... 1959 LAND ROVER SERIES II LWB <!doctype html>\n<html class="no-js" lang="en"... [' 1959 LAND ROVER SERIES II LWB\nBackground\nFro...
736 2000 MG MGF VVC 1.8 For Sale by Auction ['This delightful and honest little 1.8-litre ... 2000 MG MGF VVC 1.8 <!doctype html>\n<html class="no-js" lang="en"... [' 2000 MG MGF VVC 1.8\nBackground\nThe MG F and ...
854 1989 Peugeot 205 GTi 1.9 For Sale by Auction ['First registered in August 1989, the vendor ... 1989 Peugeot 205 GTi 1.9 <!doctype html>\n<html class="no-js" lang="en"... [' 1989 Peugeot 205 GTi 1.9\nBackground\nLaunched...
44 2008 Alpina BMW D3 For Sale by Auction ['One of only 614 ever produced, this lovely A... 2008 Alpina BMW D3 <!doctype html>\n<html class="no-js" lang="en"... [' 2008 Alpina BMW D3\nBackground\nFollowing the ...
797 1963 MGB Roadster For Sale by Auction ['With just one previous keeper, a Dr Chapman ... 1963 MGB Roadster <!doctype html>\n<html class="no-js" lang="en"... [' 1963 MGB Roadster\nBackground\nIntroduced in 1...
149 2010 BENTLEY Flying Spur Speed For Sale by Auc... ['First registered on the 5th of November 2010... 2010 BENTLEY Flying Spur Speed <!doctype html>\n<html class="no-js" lang="en"... [' 2010 BENTLEY Flying Spur Speed\nBackground\nEs...
691 1990 Mercedes 190E 2.0 For Sale by Auction ['This is a five-owner-from new example finish... 1990 Mercedes 190E 2.0 <!doctype html>\n<html class="no-js" lang="en"... [' 1990 Mercedes 190E 2.0\nBackground\nThe W201 1...
1201 1995 MERCEDES-BENZ SL60 AMG For Sale by Auction ['1995 MERCEDES-BENZ SL60 AMG 43k Miles - Imma... 1995 MERCEDES-BENZ SL60 AMG <!doctype html>\n<html class="no-js" lang="en"... [' 1995 MERCEDES-BENZ SL60 AMG\nBackground\nMuch ...
598 1972 MERCEDES-BENZ 250CE W114\nBackground\nThe...1026 1965 SUNBEAM ALPINE Series V\nBackground\nFoll...265 1988 DAIMLER Double Six\nBackground\nJaguar's ...766 1961 MGA Roadster 1600 Mk 1\nBackground\nThe M...521 1955 LAND ROVER Series 1 Soft Top. 86 Inch\nBa...Name: text, dtype: object

Prepare the data

Next, we’ll tidy up the data a little. There are some duplicate page titles in here, so we’ll drop these from the dataframe and return a list of the indices, so we can use them for looking up values. We’ll also fill in some NaN values with blanks to avoid TF-IDF complaining.

indices = pd.Series(df.index, index=df['title']).drop_duplicates()content = df['text'].fillna('')

Create TF-IDF model

First, we’ll set up TfidfVectorizer and tell it to use English stop words. This will remove common words like “the” and “of” to leave the more important ones. TF-IDF will additionally down-weight common words that appear across documents.

tfidf = TfidfVectorizer(stop_words='english')

Next, we’ll create a TF-IDF matrix by passing the text column to the fit_transform() function. That will give us the numbers from which we can calculate similarities.

tfidf_matrix = tfidf.fit_transform(content)

Now we have our matrix of TF-IDF vectors, we can use linear_kernel() to calculate a cosine similarity matrix for the vectors. There are several ways to do this, but the below approach worked for me.

cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

Get recommendations based on text similarity

Now the model is built, and we have our TF-IDF matrix and a cosine similarity matrix covering all the documents, we can create a helper function to generate content recommendations. The code in this is a bit fiddly, so I’ve annotated it at each step.

Basically, it takes the dataframe of text, the name of the column being used to search from, the value to search for, the cosine similarity matrix, and the number of recommendations to return. It then looks up the title and returns the documents with the closest cosine similarity.

def get_recommendations(df, column, value, cosine_similarities, limit=10): """Return a dataframe of content recommendations based on TF-IDF cosine similarity. Args: df (object): Pandas dataframe containing the text data. column (string): Name of column used, i.e. 'title'. value (string): Name of title to get recommendations for, i.e. 1982 Ferrari 308 GTSi For Sale by Auction cosine_similarities (array): Cosine similarities matrix from linear_kernel limit (int, optional): Optional limit on number of recommendations to return. Returns: Pandas dataframe. """ # Return indices for the target dataframe column and drop any duplicates indices = pd.Series(df.index, index=df[column]).drop_duplicates() # Get the index for the target value target_index = indices[value] # Get the cosine similarity scores for the target value cosine_similarity_scores = list(enumerate(cosine_similarities[target_index])) # Sort the cosine similarities in order of closest similarity cosine_similarity_scores = sorted(cosine_similarity_scores, key=lambda x: x[1], reverse=True) # Return tuple of the requested closest scores excluding the target item and index cosine_similarity_scores = cosine_similarity_scores[1:limit+1] # Extract the tuple values index = (x[0] for x in cosine_similarity_scores) scores = (x[1] for x in cosine_similarity_scores) # Get the indices for the closest items recommendation_indices = [i[0] for i in cosine_similarity_scores] # Get the actutal recommendations recommendations = df[column].iloc[recommendation_indices] # Return a dataframe df = pd.DataFrame(list(zip(index, recommendations, scores)), columns=['index','recommendation', 'cosine_similarity_score']) return df

Generate the recommendations

Finally, we can put it in action and see how it works. First, we’ll take the title of the “1982 Ferrari 308 GTSi For Sale by Auction” auction and see what we get back. It works perfectly. The closest matches are the 308 GTB, the 308 GTS, and another 308 GTB, followed by more Ferraris.

recommendations = get_recommendations(df, 'title', '1982 Ferrari 308 GTSi For Sale by Auction', cosine_similarities)
index recommendation cosine_similarity_score
0 284 1976 FERRARI 308GTB VETRORESINA For Sale by Au... 0.554754
1 282 1985 FERRARI 308 GTS QV For Sale by Auction 0.424918
2 285 1977 Ferrari 308GTB For Sale by Auction 0.384198
3 296 1999 FERRARI F355 F1 GTS For Sale by Auction 0.335060
4 295 1996 FERRARI F355 GTS - Manual For Sale by Auc... 0.309254
5 293 2006 FERRARI 612 SCAGLIETTI For Sale by Auction 0.302505
6 288 1992 FERRARI 348tb For Sale by Auction 0.302221
7 297 1998 FERRARI F355 Spider For Sale by Auction 0.300773
8 281 1973 Ferrari 246GT Dino For Sale by Auction 0.298583
9 294 1999 FERRARI F355 F1 Berlinetta For Sale by Au... 0.294583

The “1959 LAND ROVER SERIES II LWB For Sale by Auction” search was a bit tougher, but all the Series II Land Rovers do appear at the top, along with a Range Rover, which seems fair enough. The approach seems to work really well on this content.

recommendations = get_recommendations(df, 'title', '1959 LAND ROVER SERIES II LWB For Sale by Auction', cosine_similarities)
index recommendation cosine_similarity_score
0 527 1968 LAND ROVER SERIES II A Pick up For Sale b... 0.434031
1 521 1955 LAND ROVER Series 1 Soft Top. 86 Inch For... 0.425604
2 528 1958 Land Rover SERIES II SWB For Sale by Auction 0.415383
3 535 1967 LAND ROVER SERIES IIa 88inch For Sale by ... 0.408842
4 523 1968 LAND ROVER Series 2A For Sale by Auction 0.401876
5 529 1963 LAND ROVER SERIES II 88" For Sale by Auction 0.398268
6 525 1979 LAND ROVER Series 3 88 For Sale by Auction 0.392146
7 957 1999 RANGE ROVER P38 TReK Expedition For Sale ... 0.390698
8 499 1970 Land Rover 1/2 ton Lightweight V8 Series ... 0.389819
9 539 1969 Land Rover SWB For Sale by Auction 0.384898

Matt Clarke, Saturday, August 14, 2021

How to create content recommendations using TF IDF (2024)


What is content based recommendation system TF-IDF? ›

In a content-based Recommender system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present).

What is the best algorithm for content based recommendation system? ›

Cosine similarity is often considered the best measure for content-based recommendation systems as it is well suited for sparse data, which is common in recommendation systems. However, the best measure may vary depending on the data, the requirements of the system, and the goals of the recommendation.

How you can build simple recommender systems with surprise? ›

How to Create a Recommendation System with Surprise
  1. Install Surprise.
  2. Load Data.
  3. Train-Test Split.
  4. Choose an Algorithm.
  5. Train the Model.
  6. Make Predictions.
  7. Evaluate the Model.
  8. Conclusion.
Jul 19, 2023

What is the methodology for content-based recommendation system? ›

The system uses collaborative filtering method to overcome scalability issue by generating a table of similar items offline through the use of item-to-item matrix. The system then recommends other products which are similar online according to the users' purchase history.

What is an example of a content recommendation? ›

For example, if user A likes the same TV shows as user B, and user A also likes polo shirts, a collaborative filtering engine might surmise that user B would also like polo shirts and recommend polo shirt-related content to that person.

What is better than TF-IDF? ›

Why are embeddings (usually) better than TF-IDF? There are several reasons why embeddings are usually better than TF-IDF for text classification. Unlike TF-IDF, which only considers the frequency of words in a document, embeddings capture the semantic meaning of words.

What is the difference between Bert and TF-IDF? ›

BERT is an ML/NLP technique developed by Google that uses a transformer based ML model to convert phrases, words, etc into vectors. Key differences between TF-IDF and BERT are as follows: TF-IDF does not take into account the semantic meaning or context of the words whereas BERT does.

What are the three types of content based instruction? ›

The three main models of content-based instruction are the theme-based language instruction model, the sheltered content instruction model, and the adjunct language instruction model.

What is the shortcoming of content-based recommender systems? ›

Disadvantages. Building a content-based recommender engine requires a lot of domain knowledge since the feature selection of the items is mostly hard-coded into the system.

What is the architecture of content based recommendation? ›

Generally, a high-level architecture of content-based recommender system consists of three components: (1) content analyzer, (2) profile learner and (3) filtering component [3].

What is the most popular recommendation algorithm? ›

KNN-based collaborative filters

Collaborative filtering (CF) is one of the most commonly used recommendation system algorithms.

What 2 techniques do recommender systems use? ›

There are several algorithms that can be used to build recommender systems, such as collaborative filtering, content-based filtering, and hybrid systems that combine both methods. The goal of a recommender system is to personalize the user experience by providing highly relevant and useful recommendations.

How difficult is it to build a recommendation system? ›

Building a successful and robust recommendation system can be relatively straightforward if you're following the basic steps to grow from raw data to a prediction.

Which of the following is an example of content-based recommendation systems? ›

An example of a content-based recommendation system is recommending movies to a user based on the information (tags, genre, description etc) of movies rated by that user in the past.

What is a content-based movie recommendation system? ›

Movie recommendation system provides the mechanism and classifying the users with the same interest and searches for the content that would be so much interesting belonging to different set of users and then creating different kind of lists and providing interesting recommendations to the individual based on the ...

Top Articles
Latest Posts
Article information

Author: Pres. Carey Rath

Last Updated:

Views: 5307

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Pres. Carey Rath

Birthday: 1997-03-06

Address: 14955 Ledner Trail, East Rodrickfort, NE 85127-8369

Phone: +18682428114917

Job: National Technology Representative

Hobby: Sand art, Drama, Web surfing, Cycling, Brazilian jiu-jitsu, Leather crafting, Creative writing

Introduction: My name is Pres. Carey Rath, I am a faithful, funny, vast, joyous, lively, brave, glamorous person who loves writing and wants to share my knowledge and understanding with you.