ValueError:找到样本数不一致的数组:[4 16149]

时间:2016-05-17 00:31:11

标签: python scikit-learn classification data-science

嗨,我一般都是scikit学习和数据科学的新手。我试图从我的矢量化器中检索最丰富的功能时遇到上述问题。我的代码(编辑以反映@Jang的评论):

values = dataset.data
word_vectorizer = CountVectorizer(analyzer='word', stop_words=custom_stop_words)
trainset = word_vectorizer.fit_transform(values)
tags = ['dem','rep','dem','rep']
tags = np.array(tags)
trainset = trainset.toarray()

word_svm = svm.LinearSVC()
word_svm.fit(trainset, tags)


def most_informative_feature_for_binary_classification(vectorizer, classifier, n=10):
class_labels = classifier.classes_
feature_names = vectorizer.get_feature_names()
topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]

for coef, feat in topn_class1:
    print class_labels[0], coef, feat

print

for coef, feat in reversed(topn_class2):
    print class_labels[1], coef, feat


most_informative_feature_for_binary_classification(word_vectorizer, word_svm)

终端输出:

Traceback (most recent call last):
File "classification.py", line 251, in <module>
word_svm.fit(trainset, tags)
File "/usr/local/lib/python2.7/site-packages/sklearn/svm/classes.py", line 205, in fit
dtype=np.float64, order="C")
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 520, in check_X_y
check_consistent_length(X, y)
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 176, in check_consistent_length
"%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [    4 16149]

我很感激这方面的任何和所有帮助。如果我没有提供足够的信息,请告诉我。提前感谢您的时间!

1 个答案:

答案 0 :(得分:0)

这是失败的地方 - 两个参数应该是相同的类型 - 数组

word_svm.fit(trainset, tags)

标签不是数组,应转换为数组

tags = ['dem','rep','dem','rep']

您可以使用print查看它们是否属于同一类型

print type(tags)
print type(trainset)

下面的代码是用文本编辑器编写的,没有运行,不保证工作,但你明白了,我可能错了转换为数组,List很好。

您的火车组正确包含无效数据,请替换

word_svm.fit(trainset, tags)

用这个:

trainset_good, trainset_bad = trainset_check(trainset, tags)
print 'Bad data\n'
print trainset_bad
if len(trainset_good)==0:
   print 'No good valid data found, exit'
   sys.exit(1)

# use good data
word_svm.fit(trainset_good, tags)

将此功能添加到代码

def trainset_check(trainset, tags):
    trainset_good = []
    trainset_bad = []
    if not trainset:
        print 'Err trainset is empty'
        return trainset_good, trainset_bad
    if not tags:
        print 'Err - tags empty'
        return trainset_good, trainset_bad
    if len(trainset)==0:
        print 'Err trainset is empty'
        return trainset_good, trainset_bad
    if len(tags)==0:
        print 'Err tags empty'
        return trainset_good, trainset_bad
    for item in trainset:
        if len(item) != len(tags):
            print 'Error - trainset item is not the same length as tags'
            print item
            trainset_bad.append(item)
            # skip to next
            continue
        # filter out None type
        item_new = filter(None, item)
        if len(item_new) != len(tags):
            print 'Error - trainset item is not the same length as tags'
            # bad trainset data, skip to next
            print item
            trainset_bad.append(item)
            continue
         trainset_good.append(item)
    return trainset_good, trainset_bad