我的数据样本为750x256。
X_train
如果我将数据拆分为20%。我将获得y_train
600个样本和decisionTreeRegressor
150个样本。
然后在执行Number of y_train=150 does not match number of samples=600
它会说import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import graphviz
#Load the data
dataset = pd.read_csv('new_york.csv')
dataset['Higher'] = dataset['2016-12'].gt(dataset['2016-11']).astype(int)
X = dataset.iloc[:, 6:254].values
y = dataset.iloc[:, 255].values
#Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, :248])
X[:, :248] = imputer.transform(X[:, :248])
#Split the data into train and test sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_test, y_train = train_test_split(X, y, test_size = .2, random_state = 0)
#let's build our first model
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz
clf = DecisionTreeClassifier(max_depth=6)
clf.fit(X_train, y_train)
clf.score(X_train, y_train)
但如果我将test_size分成50%,那么它就可以了。 有没有办法解决这个问题?我不想使用50%的test_size。
任何帮助都会很棒!
这是我的代码:
request({ url: proxyURL + url, method: req.method, json: reqData },
(error) => {
if (error) {
console.error(error);
}
}).pipe(res);
答案 0 :(得分:1)
train_test_split()
returns X_train, X_test, y_train, y_test
, you have y_train and y_test in the wrong order.
If you use a split of 50% this is not causing an error because y_train and y_test will have the same size (but the wrong values obviously).