在Tfidf.fit_transform中,我们仅使用参数X,而没有使用y来拟合数据集。 这是正确的吗? 我们只为训练集的参数生成tfidf矩阵,没有在模型拟合中使用ytrain。 那么我们如何对测试数据集进行预测
答案 0 :(得分:1)
https://datascience.stackexchange.com/a/12346/122很好地解释了为什么将其称为fit()
,transform()
和fit_transform()
。
要点
fit()
:使矢量化器/模型适合训练数据,并将矢量化器/模型保存到变量中(返回sklearn.feature_extraction.text.TfidfVectorizer
)
transform()
:使用fit()
的变量输出来转换验证/测试数据(返回scipy.sparse.csr.csr_matrix
)
fit_transform()
:有时您直接转换训练数据,因此一起使用fit()
+ transform()
,因此使用fit_transform()
。 (返回scipy.sparse.csr.csr_matrix
)
例如
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix
# The *TfidfVectorizer* from sklearn expects list of strings as input.
sent0 = "The quick brown fox jumps over the lazy brown dog .".lower()
sent1 = "Mr brown jumps over the lazy fox .".lower()
sent2 = "Roses are red , the chocolates are brown .".lower()
sent3 = "The frank dog jumps through the red roses .".lower()
dataset = [sent0, sent1, sent2, sent3]
# Initialize the parameters of the vectorizer
vectorizer = TfidfVectorizer(input=dataset, analyzer='word', ngram_range=(1,1),
min_df = 0, stop_words=None)
[输出]:
# Learns the vocabulary of vectorizer based on the initialized parameter.
>>> vectorizer = vectorizer.fit(dataset)
# Apply the vectorizer to new sentence.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."])
<1x15 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
# Output to array form.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."]).toarray()
array([[0. , 0.31342551, 0. , 0.38714286, 0. ,
0. , 0.31342551, 0. , 0. , 0. ,
0. , 0. , 0.38714286, 0.51249178, 0.49104163]])
# When you don't need to save the vectorizer for re-using.
>>> vectorizer.fit_transform(dataset)
<4x15 sparse matrix of type '<class 'numpy.float64'>'
with 28 stored elements in Compressed Sparse Row format>
>>> vectorizer.fit_transform(dataset).toarray()
array([[0. , 0.49642852, 0. , 0.30659399, 0.30659399,
0. , 0.24821426, 0.30659399, 0. , 0.30659399,
0.38887561, 0. , 0. , 0.40586285, 0. ],
[0. , 0.32107915, 0. , 0. , 0.39659663,
0. , 0.32107915, 0.39659663, 0.50303254, 0.39659663,
0. , 0. , 0. , 0.26250325, 0. ],
[0.76012588, 0.24258925, 0.38006294, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0.29964599, 0.29964599, 0.19833261, 0. ],
[0. , 0. , 0. , 0.34049544, 0. ,
0.4318753 , 0.27566041, 0. , 0. , 0. ,
0. , 0.34049544, 0.34049544, 0.45074089, 0.4318753 ]])
>>> type(vectorizer)
<class 'sklearn.feature_extraction.text.TfidfVectorizer'>
>>> type(vectorizer.fit_transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>
>>> type(vectorizer.transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>