我读过许多博客,但对答案不满意,假设我在一些文档示例中训练了tf-idf模型:
" John like horror movie."
" Ryan watches dramatic movies"
------------so on ----------
我使用此功能:
from sklearn.feature_extraction.text import TfidfTransformer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print((X_train_counts.todense()))
# Gives count of words in each document
But it doesn't tell which word? How to get words as headers in X_train_counts
outputs. Similarly in X_train_tfidf ?
因此X_train_tfidf输出将是具有tf-idf得分的矩阵:
Horror watch movie drama
doc1 score1 -- -----------
doc2 ------------------------
这正确吗?
fit
和transformation
是做什么的?
在sklearn中提到:
fit(..)方法使我们的估计器适合数据,其次是transform(..)方法,将我们的计数矩阵转换为tf-idf表示形式。
estimator to the data
是什么意思?
现在假设有新的测试文档出现了:
" Ron likes thriller movies"
如何将此文档转换为tf-idf?我们不能将其转换为tf-idf对吗?
如何处理火车文档中没有的单词thriller
。
答案 0 :(得分:1)
以两个文本作为输入
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
text = ["John like horror movie","Ryan watches dramatic movies"]
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(text)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
pd.DataFrame(X_train_tfidf.todense(), columns = count_vect.get_feature_names())
o / p
dramatic horror john like movie movies ryan watches 0 0.000000 0.471078 0.471078 0.471078 0.471078 0.335176 0.000000 0.000000 1 0.363788 0.000000 0.000000 0.000000 0.000000 0.776515 0.363788 0.363788
现在测试它的新注释,我们需要使用转换功能,词汇量不足的单词在向量化时将被忽略。
new_comment = ["ron don't like dramatic movie"]
pd.DataFrame(tfidf_transformer.transform(count_vect.transform(new_comment)).todense(), columns = count_vect.get_feature_names())
dramatic horror john like movie movies ryan watches
0 0.57735 0.0 0.0 0.57735 0.57735 0.0 0.0 0.0
如果您想使用某些单词的词汇,则要准备要使用的单词的列表,并不断在该列表中追加新单词并将列表传递给CountVectorizer
vocabulary = ['dramatic', 'movie','horror']
vocabulary.append('Thriller')
count_vect = CountVectorizer(vocabulary = vocabulary)
cont_vect.fit_transform(text)