chi平方selectKbest输入形状错误

时间:2016-08-17 07:12:26

标签: python-2.7 scikit-learn chi-squared

我对scikit和ML有点新意。我正在尝试训练Adaboost分类器进行一次vs Rest分类。我正在使用以下代码

# To Read Training data set
test = pd.read_csv("train.csv", header=0, delimiter=",", \
                   quoting=1, error_bad_lines=False)
num_reviews = len(test["text"])
clean_train_reviews = [] 
catlist=[]
for i in xrange(0,num_reviews):
    data=processText(test["text"][i])  
    data1=test["category"][i]
    clean_train_reviews.append(data)
    catlist.append(data1.split('.'))

# To read test dataset
test = pd.read_csv("test.csv", header=0, delimiter=",", \
                   quoting=1, error_bad_lines=False)
num_reviews = len(test["text"])
clean_test_reviews = [] 
for i in xrange(0,num_reviews):
    data=processText(test["text"][i])
    clean_test_reviews.append(data)
X_test=np.array(clean_test_reviews)


lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(catlist)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,2), max_features=1500,min_df=4)),
    ('tfidf', TfidfTransformer()),
    ('chi2', SelectKBest(chi2, k=200)),
    ('clf', OneVsRestClassifier(AdaBoostClassifier()))])
classifier.fit(clean_train_reviews, Y)
predicted = classifier.predict(X_test)

我使用管道,其中文本作为clean_train_reviews插入,Y是类(多标签,N = 10)。使用TfidfVectorizer()在管道中提取文本特征,并使用卡方特征选择方法进行选择。 Adaboost分类器给出:ValueError:错误的输入形状(1000,10)

 File "<ipython-input-10-9dbc8b18e6b8>", line 1, in <module>
    runfile('C:/Users/Administrator/Desktop/nincymiss/adaboost.py', wdir='C:/Users/Administrator/Desktop/nincymiss')

  File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 601, in runfile
    execfile(filename, namespace)

  File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 66, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "C:/Users/Administrator/Desktop/nincymiss/adaboost.py", line 179, in <module>
    classifier.fit(clean_train_reviews, Y)

  File "C:\Python27\lib\site-packages\sklearn\pipeline.py", line 164, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)

  File "C:\Python27\lib\site-packages\sklearn\pipeline.py", line 145, in _pre_transform
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])

  File "C:\Python27\lib\site-packages\sklearn\base.py", line 458, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)

  File "C:\Python27\lib\site-packages\sklearn\feature_selection\univariate_selection.py", line 322, in fit
    X, y = check_X_y(X, y, ['csr', 'csc'])

  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 515, in check_X_y
    y = column_or_1d(y, warn=True)

  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 551, in column_or_1d
    raise ValueError("bad input shape {0}".format(shape))

ValueError: bad input shape (1000, 10)

1 个答案:

答案 0 :(得分:0)

这是因为功能选择无法满足您对多标签问题的期望。您可以尝试以下选择最佳&#39;每个标签的功能分别。

classifier = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,2), max_features=1500, min_df=4)),
    ('tfidf', TfidfTransformer()),
    ('chi2', SelectKBest(chi2, k=200)),
    ('clf', AdaBoostClassifier())])

clf = OneVsRestClassifier(classifier)