如何在测试集上应用TFIDF

时间:2019-08-20 11:02:14

标签: python scikit-learn tf-idf

假设我有两个文本文件。文件1包含训练集,该训练集主要用于定义词汇表。文件2是用户输入的单词。

d1 = (
"Project 1 details on Machine learning",
"Project 2 detail on machine learning and statics",
"Project 3 is on mach learn as well"
)

d2 = (
"Projects related to machine learning"
)

现在使用sklearn,我们找到d1的tfidf

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print( tfidf_matrix.shape)

现在对于查询d2,我想基于从d1中学到的知识来计算tfidf向量。我该怎么办?

2 个答案:

答案 0 :(得分:0)

SKLearn中的任何变压器一样,将.fit放在火车上(在本例中为.fit_transform(d1))上,可以将transform与{ {1}}

答案 1 :(得分:0)

您可以将第一个矢量化器的vocabulary_属性作为参数传递给第二个矢量化器:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer1 = TfidfVectorizer()
vectorizer2 = TfidfVectorizer()

vectorizer1.fit_transform(d1)
vectorizer2 = TfidfVectorizer(vocabulary=vectorizer1.vocabulary_)

vectorizer2.fit_transform(d2)