如何拆分数据集 - 标签数= 150与样本数= 600不匹配

时间:2017-04-08 22:12:26

标签: python pandas machine-learning scikit-learn decision-tree

我的数据样本为750x256。

X_train

如果我将数据拆分为20%。我将获得y_train 600个样本和decisionTreeRegressor 150个样本。

然后在执行Number of y_train=150 does not match number of samples=600

时会出现问题

它会说import numpy as np import pandas as pd import matplotlib.pyplot as plt import graphviz #Load the data dataset = pd.read_csv('new_york.csv') dataset['Higher'] = dataset['2016-12'].gt(dataset['2016-11']).astype(int) X = dataset.iloc[:, 6:254].values y = dataset.iloc[:, 255].values #Taking care of missing data from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) imputer = imputer.fit(X[:, :248]) X[:, :248] = imputer.transform(X[:, :248]) #Split the data into train and test sets from sklearn.cross_validation import train_test_split X_train, X_test, y_test, y_train = train_test_split(X, y, test_size = .2, random_state = 0) #let's build our first model from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz clf = DecisionTreeClassifier(max_depth=6) clf.fit(X_train, y_train) clf.score(X_train, y_train)

但如果我将test_size分成50%,那么它就可以了。 有没有办法解决这个问题?我不想使用50%的test_size。

任何帮助都会很棒!

这是我的代码:

request({ url: proxyURL + url, method: req.method, json: reqData },
  (error) => {
    if (error) {
      console.error(error);
    }
  }).pipe(res);

1 个答案:

答案 0 :(得分:1)

train_test_split() returns X_train, X_test, y_train, y_test, you have y_train and y_test in the wrong order.

If you use a split of 50% this is not causing an error because y_train and y_test will have the same size (but the wrong values obviously).