ValueError:找到样本数不一致的数组[6 1786]

时间:2016-02-13 11:18:40

标签: python machine-learning scikit-learn text-analysis

这是我的代码:

$EndPoint='https://api.sandbox.ebay.com/ws/api.dll';

$header= array(
    'Content-Type: text/xml',
    'X-EBAY-API-COMPATIBILITY-LEVEL: 921',
    'X-EBAY-API-DEV-NAME: ' . $devId,
    'X-EBAY-API-APP-NAME: ' . $appId,
    'X-EBAY-API-CERT-NAME: ' . $certId,
    'X-EBAY-API-CALL-NAME: ' . 'GeteBayTime',
    'X-EBAY-API-SITEID: ' . '101',
    'X-EBAY-API-REQUEST-ENCODING:XML'
);

$xml='<?xml version="1.0" encoding="utf-8"?>
<GeteBayTimeRequest xmlns="urn:ebay:apis:eBLBaseComponents">
</GeteBayTimeRequest>';

$connection = curl_init();
curl_setopt($connection, CURLOPT_URL, $EndPoint);
curl_setopt($connection, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($connection, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($connection, CURLOPT_HTTPHEADER, $header);
curl_setopt($connection, CURLOPT_POST, 1);
curl_setopt($connection, CURLOPT_POSTFIELDS, $xml);
curl_setopt($connection, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($connection);
curl_close($connection);
echo $response;

我收到错误,我不明白为什么。追溯:

  

追踪(最近一次调用最后一次):文件
  “C:/Users/Roman/PycharmProjects/week_3/assignment_2.py”,第23行,中
  
      gs.fit(X,y_scaled)#TODO:检查此行文件“C:\ Users \ Roman \ AppData \ Roaming \ Python \ Python35 \ site-packages \ sklearn \ grid_search.py​​”,
  第804行,in fit
      return self._fit(X,y,ParameterGrid(self.param_grid))文件“C:\ Users \ Roman \ AppData \ Roaming \ Python \ Python35 \ site-packages \ sklearn \ grid_search.py​​”,
  第525行,在_fit中       X,y =可索引(X,y)文件“C:\ Users \ Roman \ AppData \ Roaming \ Python \ Python35 \ site-packages \ sklearn \ utils \ validation.py”,

  第201行,可转位的       check_consistent_length(* result)文件“C:\ Users \ Roman \ AppData \ Roaming \ Python \ Python35 \ site-packages \ sklearn \ utils \ validation.py”,

  第176行,在check_consistent_length
中       “%s”%str(独特))

     

ValueError:发现样本数量不一致的数组:[6 1786]

有人可以解释为什么会出现这种错误吗?

1 个答案:

答案 0 :(得分:2)

我认为你对这里的Xy感到有些困惑。您希望将X转换为tf-idf向量,并使用此y进行训练。见下文

from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import datasets
import numpy as np

newsgroups = datasets.fetch_20newsgroups(
                subset='all',
                categories=['alt.atheism', 'sci.space']
         )
X = newsgroups.data
y = newsgroups.target

TD_IF = TfidfVectorizer()
X_scaled = TD_IF.fit_transform(X, y)
grid = {'C': np.power(10.0, np.arange(-1, 1))}
cv = KFold(y_scaled.size, n_folds=5, shuffle=True, random_state=241) 
clf = SVC(kernel='linear', random_state=241)

gs = GridSearchCV(estimator=clf, param_grid=grid, scoring='accuracy', cv=cv)
gs.fit(X_scaled, y)