我想在pandas中手动创建训练和测试数据集,而不是使用sklearn的交叉验证。我几乎成功了。但是,我发现 df_training 和 df_test 之间的数字存在差异。这是为什么?
这就是我的所作所为:
当df和df_training的尺寸保持不变时,我无法得到df_test的修正尺寸。
from sklearn.datasets import load_boston
boston = load_boston()
names = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat']
df = pd.DataFrame(boston.data, columns=names)
# add in prices
df['price'] = boston.target
df.shape
(506, 14)
import random
# Use 70% of the DataFrame and call is df_training
df_training = df.ix[np.random.choice(df.index, 354)]
df_training.shape
# Remove the 70% of data from the main DataFrame and call it df_test
df_test = df.drop(df_training.index)
df_test.shape
(250, 14)
我不应该得到504 - 354 = 150吗?
有趣的是,当我多次运行整个代码时,我会得到不同的test_set结果。当训练集和原始集不变时,我不应该得到相同的结果吗?这是怎么回事?
In [26]: %run create_training.py
Original Set: (506, 14)
training set: (354, 14)
test set: (247, 14)
In [27]: %run create_training.py
Original Set: (506, 14)
training set: (354, 14)
test set: (254, 14)
In [28]: %run create_training.py
Original Set: (506, 14)
training set: (354, 14)
test set: (241, 14)
答案 0 :(得分:0)
我认为这里缺少的两个成分是:
numpy
随机函数设置种子,以便使分割重现。np.random.choice
(refer to the docs更多信息)致电replacement=False
。代码:
# make results reproducible
np.random.seed(42)
# sample without replacement
train_ix = np.random.choice(df.index, 354, replace=False)
df_training = df.ix[train_ix]
df_test = df.drop(train_ix)