Question

我正在尝试使用numpy编写自己的火车测试分割功能，而不是使用sklearn的train_test_split功能。我将数据分成70％的培训和30％的测试。我正在使用sklearn的波士顿住房数据集。

这是数据的形状：

housing_features.shape #(506,13) where 506 is sample size and it has 13 features.

这是我的代码：

city_data = datasets.load_boston()
housing_prices = city_data.target
housing_features = city_data.data

def shuffle_split_data(X, y):
    split = np.random.rand(X.shape[0]) < 0.7

    X_Train = X[split]
    y_Train = y[split]
    X_Test =  X[~split]
    y_Test = y[~split]

    print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
    return X_Train, y_Train, X_Test, y_Test

try:
    X_train, y_train, X_test, y_test = shuffle_split_data(housing_features, housing_prices)
    print "Successful"
except:
    print "Fail"

我得到的打印输出是：

362 362 144 144
"Successful"

但是我知道它没有成功，因为当我再次运行它时，我获得了不同数量的长度与使用SKlearn的列车测试功能并且总是在X_train的长度上获得354。

#correct output
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_features, housing_prices, test_size=0.3, random_state=42)
print len(X_train) 
#354

我错过了我的功能？

Answer 1

因为您正在使用np.random.rand为您提供随机数字，并且对于非常大的数字，它将接近70％的0.7限制。您可以使用np.percentile获取70％的值，然后与您的值进行比较：

def shuffle_split_data(X, y):
    arr_rand = np.random.rand(X.shape[0])
    split = arr_rand < np.percentile(arr_rand, 70)

    X_train = X[split]
    y_train = y[split]
    X_test =  X[~split]
    y_test = y[~split]

    print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
    return X_train, y_train, X_test, y_test

修改

或者，您可以使用np.random.choice选择具有所需金额的索引。对于你的情况：

np.random.choice(range(X.shape[0]), int(0.7*X.shape[0]))

用numpy

1 个答案: