Sci-Kit Learn中样本数量不一致TF * IDF

时间:2018-04-03 22:43:51

标签: python machine-learning scikit-learn classification text-classification

我有三个单词列表,分别属于运动员,喜剧演员和歌手。我使用TF * IDF加权用sci-kit对这三个列表进行了矢量化学习以获得下面的x_tfidf矩阵(训练数据):

y = ['Athlete', 'Comedian', 'Singer']
x_tfidf = [[0.         0.         0.         0.         0.         0.01707793
  0.17077928 0.01707793 0.01707793 0.01707793 0.0129882  0.01707793
  0.         0.02597641 0.         0.         0.01707793 0.
  0.         0.06831171 0.         0.         0.0129882  0.03415586
  0.01707793 0.01707793 0.03415586 0.         0.01707793 0.
  0.0129882  0.         0.         0.         0.         0.
  0.01707793 0.01707793 0.         0.01707793 0.         0.01707793
  0.         0.         0.01707793 0.         0.         0.
  0.         0.         0.01707793 0.         0.0302595  0.
  0.01707793 0.         0.02597641 0.         0.         0.
  0.         0.03415586 0.01707793 0.55475746 0.01707793 0.
  0.         0.         0.         0.         0.01707793 0.
  0.         0.01707793 0.         0.         0.01707793 0.
  0.         0.03415586 0.06831171 0.01707793 0.         0.03415586
  0.         0.01707793 0.0129882  0.         0.         0.01707793
  0.05195282 0.02597641 0.020173   0.0129882  0.060519   0.02597641
  0.         0.01707793 0.         0.55475746 0.55475746 0.01707793
  0.         0.0302595  0.01707793 0.         0.         0.
  0.         0.01707793 0.         0.03415586 0.         0.
  0.         0.02597641 0.03415586 0.01707793 0.         0.05195282
  0.         0.         0.         0.         0.         0.
  0.03415586 0.         0.02597641 0.01707793 0.         0.
  0.         0.         0.0129882  0.         0.03415586 0.
  0.05123378]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.00791998 0.00791998 0.         0.
  0.         0.         0.         0.03167991 0.         0.01583996
  0.00602335 0.         0.00791998 0.         0.         0.
  0.         0.         0.         0.         0.00791998 0.
  0.         0.         0.         0.00602335 0.00791998 0.00602335
  0.00602335 0.00791998 0.         0.         0.014033   0.
  0.         0.01583996 0.         0.         0.         0.
  0.00791998 0.         0.         0.57535302 0.         0.
  0.         0.         0.         0.01807004 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.00791998 0.
  0.         0.         0.         0.00791998 0.         0.
  0.         0.         0.00467767 0.         0.00467767 0.
  0.00791998 0.         0.         0.57535302 0.57535302 0.
  0.         0.028066   0.         0.         0.01807004 0.01807004
  0.03167991 0.         0.03167991 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.00791998 0.         0.00602335
  0.         0.00791998 0.         0.         0.01807004 0.00791998
  0.         0.         0.         0.00791998 0.         0.
  0.        ]
 [0.00527285 0.00527285 0.00175762 0.01230331 0.01230331 0.
  0.         0.         0.         0.         0.00133671 0.
  0.05800134 0.31546417 0.00175762 0.00351523 0.         0.00175762
  0.00175762 0.         0.         0.         0.00133671 0.
  0.         0.         0.         0.         0.         0.
  0.         0.00175762 0.         0.00527285 0.00175762 0.00175762
  0.         0.         0.00175762 0.         0.         0.
  0.00175762 0.00527285 0.         0.00133671 0.         0.00133671
  0.00133671 0.         0.         0.00175762 0.00103808 0.00175762
  0.         0.         0.27268937 0.00351523 0.00351523 0.00175762
  0.         0.         0.         0.11937881 0.         0.0105457
  0.00527285 0.00175762 0.00175762 0.00133671 0.         0.00175762
  0.00175762 0.         0.02460663 0.00527285 0.         0.00175762
  0.00175762 0.         0.         0.         0.         0.
  0.00175762 0.         0.00401014 0.         0.00175762 0.
  0.01737726 0.29675019 0.21591993 0.00133671 0.22214839 0.31412746
  0.         0.         0.00175762 0.09654112 0.11937881 0.
  0.00351523 0.00207615 0.         0.00527285 0.00133671 0.00133671
  0.         0.         0.         0.         0.00351523 0.00175762
  0.00175762 0.00133671 0.         0.         0.00527285 0.63360177
  0.00175762 0.00703047 0.0105457  0.         0.00351523 0.00935699
  0.         0.         0.31412746 0.         0.00133671 0.
  0.00175762 0.00175762 0.00133671 0.         0.         0.0105457
  0.        ]]

我的目标是测试各种分类器,以比较sci-kit学习中各种机器学习算法的输出。也就是说,基于将用作测试数据的单词列表来预测用户是运动员,喜剧演员还是歌手。我尝试使用以下代码使用KNN:

def classify(x_tfidf, y):
    knn = neighbors.KNeighborsClassifier()
    knn.fit(x_tfidf, y)  

但是,我收到以下错误:

Traceback (most recent call last):
  File "bow.py", line 115, in <module>
    checkExists()
  File "bow.py", line 28, in checkExists
    get_tags(table)
  File "bow.py", line 34, in get_tags
    format_tags(data)
  File "bow.py", line 56, in format_tags
    vectorize(acc_list)
  File "bow.py", line 86, in vectorize
    classify(x_tag_tfidf, y)
  File "bow.py", line 95, in classify
    knn.fit(x_tag_tfidf, y)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 765, in fit
    X, y = check_X_y(X, y, "csr", multi_output=True)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 583, in check_X_y
    check_consistent_length(X, y)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 204, in check_consistent_length
    " samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [1, 3]

我试图将“y”改为np数组,而np矩阵没有成功。如果有人能指出我正确的方向,我将非常感激。

1 个答案:

答案 0 :(得分:0)

我无法重现您的错误,但是当训练样本的数量小于将要使用的群集中心的数量时,我会产生不同的错误(默认为代码中的5)。

考虑随机生成的具有更多数据点的合成数据集,并注意代码,就像您拥有它一样,工作正常:

In [17]: y = ['Athlete', 'Comedian', 'Singer'] * 20

In [18]: x = np.random.rand(60, 139)

In [19]: knn = neighbors.KNeighborsClassifier()

In [20]: knn.fit(x, y)
Out[20]: 
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [21]: knn.predict(np.random.rand(1, 139))
Out[21]: 
array(['Athlete'],
      dtype='<U8')

In [22]: knn.predict(np.random.rand(1, 139))
Out[22]: 
array(['Athlete'],
      dtype='<U8')

In [23]: knn.predict(np.random.rand(1, 139))
Out[23]: 
array(['Singer'],
      dtype='<U8')

现在请注意,如果我将玩具数据减少到3个样本,我会看到错误:

In [25]: x = np.random.rand(3, 139)

In [26]: y = ['Athlete', 'Comedian', 'Singer']

In [27]: knn = neighbors.KNeighborsClassifier()

In [28]: knn.fit(x, y)
Out[28]: 
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [29]: knn.predict(np.random.rand(1, 139))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-29-e5a6eff3cd22> in <module>()
----> 1 knn.predict(np.random.rand(1, 139))

~/anaconda/envs/py36-keras/lib/python3.6/site-packages/sklearn/neighbors/classification.py in predict(self, X)
    143         X = check_array(X, accept_sparse='csr')
    144 
--> 145         neigh_dist, neigh_ind = self.kneighbors(X)
    146 
    147         classes_ = self.classes_

~/anaconda/envs/py36-keras/lib/python3.6/site-packages/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance)
    345                 "Expected n_neighbors <= n_samples, "
    346                 " but n_samples = %d, n_neighbors = %d" %
--> 347                 (train_size, n_neighbors)
    348             )
    349         n_samples, _ = X.shape

ValueError: Expected n_neighbors <= n_samples,  but n_samples = 3, n_neighbors = 5

如果我手动插入所需数量的邻居(3),那么它可以工作:

In [31]: knn = neighbors.KNeighborsClassifier(n_neighbors=3)

In [32]: knn.fit(x, y)
Out[32]: 
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [33]: knn.predict(np.random.rand(1, 139))
Out[33]: 
array(['Athlete'],
      dtype='<U8')

最后,如果您将我的示例中的x从numpy ndarray更改为列表列表,通过x.tolist(),它们的工作原理相同,因此与使用无关y或x的列表与ndarray。