sklearn.feature_extraction.text.TfidfVectorizer (2024)

class sklearn.feature_extraction.text.TfidfVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)[source]¶

Convert a collection of raw documents to a matrix of TF-IDF features.

Equivalent to CountVectorizer followed byTfidfTransformer.

For an example of usage, seeClassification of text documents using sparse features.

For an efficiency comparison of the different feature extractors, seeFeatureHasher and DictVectorizer Comparison.

See also

CountVectorizer: Transforms text into a sparse matrix of n-gram counts.
TfidfTransformer: Performs the TF-IDF transformation from a provided matrix of counts.

Notes

The stop_words_ attribute can get large and increase the model sizewhen pickling. This attribute is provided only for introspection and canbe safely removed using delattr or set to None before pickling.

Examples

>>> from sklearn.feature_extraction.text import TfidfVectorizer>>> corpus = [...  'This is the first document.',...  'This document is the second document.',...  'And this is the third one.',...  'Is this the first document?',... ]>>> vectorizer = TfidfVectorizer()>>> X = vectorizer.fit_transform(corpus)>>> vectorizer.get_feature_names_out()array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'], ...)>>> print(X.shape)(4, 9)

Methods

build_analyzer()	Return a callable to process input data.
build_preprocessor()	Return a function to preprocess the text before tokenization.
build_tokenizer()	Return a function that splits a string into a sequence of tokens.
decode(doc)	Decode the input into a string of unicode symbols.
fit(raw_documents[,y])	Learn vocabulary and idf from training set.
fit_transform(raw_documents[,y])	Learn vocabulary and idf, return document-term matrix.
get_feature_names_out([input_features])	Get output feature names for transformation.
get_metadata_routing()	Get metadata routing of this object.
get_params([deep])	Get parameters for this estimator.
get_stop_words()	Build or fetch the effective stop words list.
inverse_transform(X)	Return terms per document with nonzero entries in X.
set_fit_request(*[,raw_documents])	Request metadata passed to the `fit` method.
set_params(**params)	Set the parameters of this estimator.
set_transform_request(*[,raw_documents])	Request metadata passed to the `transform` method.
transform(raw_documents)	Transform documents to document-term matrix.

build_analyzer()[source]¶

Return a callable to process input data.

The callable handles preprocessing, tokenization, and n-grams generation.

Returns:

analyzer: callable: A function to handle preprocessing, tokenizationand n-grams generation.

build_preprocessor()[source]¶

Return a function to preprocess the text before tokenization.

Returns:

preprocessor: callable: A function to preprocess the text before tokenization.

build_tokenizer()[source]¶

Return a function that splits a string into a sequence of tokens.

Returns:

tokenizer: callable: A function to split a string into a sequence of tokens.

decode(doc)[source]¶

Decode the input into a string of unicode symbols.

The decoding strategy depends on the vectorizer parameters.

Parameters:

docbytes or str: The string to decode.

Returns:

doc: str: A string of unicode symbols.

fit(raw_documents, y=None)[source]¶

Learn vocabulary and idf from training set.

Parameters:

raw_documentsiterable: An iterable which generates either str, unicode or file objects.
yNone: This parameter is not needed to compute tfidf.

Returns:

selfobject: Fitted vectorizer.

fit_transform(raw_documents, y=None)[source]¶

Learn vocabulary and idf, return document-term matrix.

This is equivalent to fit followed by transform, but more efficientlyimplemented.

Parameters:

raw_documentsiterable: An iterable which generates either str, unicode or file objects.
yNone: This parameter is ignored.

Returns:

Xsparse matrix of (n_samples, n_features): Tf-idf-weighted document-term matrix.

get_feature_names_out(input_features=None)[source]¶

Get output feature names for transformation.