嘿,我采用了3种不同的方法,但是我无法确定哪种方法是使用TF-IDF的正确方法:
第一个代码确实适合并转换为x_train和x_test,分别给出(5000,94462) (5000,93007)。
第二个代码同时使用了训练和测试,我认为这是不对的,因为idf仅根据训练文档计算得出(5000,152800)(5000,152800)。
第三个代码给出(5000,94462)(5000,94462)。
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
xtrain_tfidf = vectorizer.fit_transform(x_train)
xtest_tfidf = vectorizer.fit_transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(x_train+x_test)
xtrain_tfidf = vectorizer.transform(x_train)
xtest_tfidf = vectorizer.transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
vect.fit(x_train)
x_test_vectorized = vect.transform(x_test)
答案 0 :(得分:0)
正确的方法是fit
和transform
== fit_transform
您的训练数据和仅 transform
测试数据。
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
xtrain_tfidf = vectorizer.fit_transform(x_train)
xtest_tfidf = vectorizer.transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)
您从不 fit_transform
测试数据。