使用tf-idf变换进行线性回归

时间:2016-07-22 19:13:28

标签: scikit-learn linear-regression tf-idf

我有两个数据帧,前者包含>列中有700个预测变量,后者包含一列。前者用作预测因子(所有值均为0和1,但由于稀疏度大多为0),第二个用作模型训练和测试的响应。第一个名称为ser,第二个名称为star

我使用以下内容进行tf-idf转换

from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()

A = transformer.fit_transform(ser)

以下显示了print(A)

的一部分
 (0, 302)   0.613133438876
 (0, 202)   0.789979358042
 (1, 556)   1.0
 (2, 556)   0.432375068194
 (2, 17)    0.901693850708
 (3, 556)   0.269567465847
 (3, 335)   0.671245025218
 (3, 256)   0.400099662956
 (3, 238)   0.562746618986
 (4, 556)   0.401348891903
 (4, 137)   0.915925251846
 (5, 641)   0.785485510985
 (5, 396)   0.618880046562
 (6, 317)   0.525163047715
 (6, 305)   0.851001629443
 ... (more are cut)

我是否正确使用此tf-idf转换?由于我有以下内容,我收到错误,我将在帖子结尾发布。

star = pd.DataFrame({"star": star})
data = pd.concat([ser, star], axis = 1)

from sklearn.linear_model import LinearRegression

D = LinearRegression()

Dfit = D.fit(ser, star, sample_weight = A)
Dpred = D.predict(ser)
Dscore = D.score(ser,star)
print(Dscore)

错误

Traceback (most recent call last):
File "categories_model.py", line 67, in <module>
Dfit = D.fit(ser, star, sample_weight = A)
File "/opt/conda/lib/python2.7/site-packages/sklearn/linear_model/base.py", line 434, in fit
sample_weight=sample_weight)
File "/opt/conda/lib/python2.7/site-packages/sklearn/linear_model/base.py", line 127, in center_data
X_mean = np.average(X, axis=0, weights=sample_weight)
File "/opt/conda/lib/python2.7/site-packages/numpy/lib/function_base.py", line 937, in average
"1D weights expected when shapes of a and weights differ.")
TypeError: 1D weights expected when shapes of a and weights differ.

有谁能帮我理解所有这些以及如何改进代码?谢谢!!

1 个答案:

答案 0 :(得分:0)

错误来自错误的转换矩阵。这解决了这个问题。

Dfit = D.fit(A, star)