我有一个相当大的数据集形式的数据集,我想知道如何将数据帧分成两个随机样本(80%和20%)进行训练和测试。
谢谢!
答案 0 :(得分:467)
scikit learn's train_test_split
是个好人。
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
答案 1 :(得分:257)
我会使用numpy' randn
:
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]
只是看到这个有效:
In [15]: len(test)
Out[15]: 21
In [16]: len(train)
Out[16]: 79
答案 2 :(得分:215)
Pandas随机样本也会起作用
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
答案 3 :(得分:25)
我会使用scikit-learn自己的training_test_split,并从索引生成它
from sklearn.cross_validation import train_test_split
y = df.pop('output')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
答案 4 :(得分:13)
无需转换为numpy。只需使用pandas df进行拆分,它将返回pandas df。
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
如果要从y中拆分x
X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)
答案 5 :(得分:11)
您可以使用以下代码创建测试和训练样本:
from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)
测试大小可能会根据您要在测试和训练数据集中放入的数据百分比而有所不同。
答案 6 :(得分:6)
有许多有效的答案。再添一个。 来自sklearn.cross_validation import train_test_split
#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]
答案 7 :(得分:5)
您也可以考虑将分层划分为训练和测试集。 Startized division还会随机生成训练和测试集,但这样可以保留原始的比例。这使得训练和测试集更好地反映了原始数据集的属性。
import numpy as np
def get_train_test_inds(y,train_proportion=0.7):
'''Generates indices, making random stratified split into training set and testing sets
with proportions train_proportion and (1-train_proportion) of initial sample.
y is any iterable indicating classes of each observation in the sample.
Initial proportions of classes inside training and
testing sets are preserved (stratified sampling).
'''
y=np.array(y)
train_inds = np.zeros(len(y),dtype=bool)
test_inds = np.zeros(len(y),dtype=bool)
values = np.unique(y)
for value in values:
value_inds = np.nonzero(y==value)[0]
np.random.shuffle(value_inds)
n = int(train_proportion*len(value_inds))
train_inds[value_inds[:n]]=True
test_inds[value_inds[n:]]=True
return train_inds,test_inds
df [train_inds]和df [test_inds]为您提供原始DataFrame df的训练和测试集。
答案 8 :(得分:4)
您可以使用〜(波浪号运算符)排除使用df.sample()采样的行,让熊猫单独处理采样和索引过滤,以获得两个集合。
train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]
答案 9 :(得分:3)
shuffle = np.random.permutation(len(df))
test_size = int(len(df) * 0.2)
test_aux = shuffle[:test_size]
train_aux = shuffle[test_size:]
TRAIN_DF =df.iloc[train_aux]
TEST_DF = df.iloc[test_aux]
答案 10 :(得分:3)
import pandas as pd
from sklearn.model_selection import train_test_split
datafile_name = 'path_to_data_file'
data = pd.read_csv(datafile_name)
target_attribute = data['column_name']
X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)
答案 11 :(得分:2)
如果您需要根据数据集中的标签列拆分数据,可以使用:
def split_to_train_test(df, label_column, train_frac=0.8):
train_df, test_df = pd.DataFrame(), pd.DataFrame()
labels = df[label_column].unique()
for lbl in labels:
lbl_df = df[df[label_column] == lbl]
lbl_train_df = lbl_df.sample(frac=train_frac)
lbl_test_df = lbl_df.drop(lbl_train_df.index)
print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
train_df = train_df.append(lbl_train_df)
test_df = test_df.append(lbl_test_df)
return train_df, test_df
并使用它:
train, test = split_to_train_test(data, 'class', 0.7)
如果你想控制分裂随机性或使用一些全局随机种子,你也可以传递random_state。
答案 12 :(得分:2)
您需要将pandas数据帧转换为numpy数组,然后将numpy数组转换回dataframe
import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)
答案 13 :(得分:2)
这是我在分割DataFrame时所写的内容。我考虑使用上面的Andy方法,但不喜欢我无法准确控制数据集的大小(即有时会有79,有时是81等)。
def make_sets(data_df, test_portion):
import random as rnd
tot_ix = range(len(data_df))
test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
train_ix = list(set(tot_ix) ^ set(test_ix))
test_df = data_df.ix[test_ix]
train_df = data_df.ix[train_ix]
return train_df, test_df
train_df, test_df = make_sets(data_df, 0.2)
test_df.head()
答案 14 :(得分:1)
要分成两个以上的类,例如训练,测试和验证,可以这样做:
probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85
df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]
这将使70%的数据用于培训,15%用于测试,15%用于验证。
答案 15 :(得分:1)
只需从df中选择范围行,就像这样
row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]
答案 16 :(得分:1)
创建训练/测试甚至验证样本的方法有很多。
案例1:经典方式train_test_split
,没有任何选择:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)
案例2:一个非常小的数据集(<500行)的情况:为了通过交叉验证获得所有行的结果。最后,您将对可用训练集的每一行都有一个预测。
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
案例3a:出于分类目的的不平衡数据集。在第一种情况之后,这里是等效的解决方案:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)
情况3b:出于分类目的的不平衡数据集。在案例2之后,这是等效的解决方案:
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
案例4:您需要在大数据上创建一个训练/测试/验证集,以调整超参数(60%的训练,20%的测试和20%的val)。
from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)
答案 17 :(得分:1)
您可以使用df.as_matrix()函数并创建Numpy-array并传递它。
Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)
答案 18 :(得分:1)
就我而言,我想在 Train、test 和 dev 中拆分具有特定数字的数据框。我在这里分享我的解决方案
首先,为数据框分配一个唯一的 id(如果已经不存在)
import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]
这是我的拆分号码:
train = 120765
test = 4134
dev = 2816
分割函数
def df_split(df, n):
first = df.sample(n)
second = df[~df.id.isin(list(first['id']))]
first.reset_index(drop=True, inplace = True)
second.reset_index(drop=True, inplace = True)
return first, second
现在分为训练、测试、开发
train, test = df_split(df, 120765)
test, dev = df_split(test, 4134)
答案 19 :(得分:0)
这个怎么样? df是我的数据框
total_size=len(df)
train_size=math.floor(0.66*total_size) (2/3 part of my dataset)
#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)
答案 20 :(得分:0)
让我觉得更优雅的一点是创建一个随机列,然后用它拆分,这样我们就可以得到一个适合我们需求的拆分,并且是随机的。
def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r
答案 21 :(得分:0)
上面有很多不错的答案,因此,如果您想仅使用numpy
库来指定火车和测试集的确切样本数量,我想再添加一个示例。
# set the random seed for the reproducibility
np.random.seed(17)
# e.g. number of samples for the training set is 1000
n_train = 1000
# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)
# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]
train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]
test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]
答案 22 :(得分:0)
如果您想稍后添加列,我认为您还需要获取副本而不是数据帧。
msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)
答案 23 :(得分:0)
如果你希望有一个数据帧和两个数据帧(不是numpy数组),这应该可以解决问题:
def split_data(df, train_perc = 0.8):
df['train'] = np.random.rand(len(df)) < train_perc
train = df[df.train == 1]
test = df[df.train == 0]
split_data ={'train': train, 'test': test}
return split_data
答案 24 :(得分:0)
如果你想把它拆分成训练集、测试集和验证集,你可以使用这个函数:
from sklearn.model_selection import train_test_split
import pandas as pd
def train_test_val_split(df, test_size=0.15, val_size=0.45):
temp, test = train_test_split(df, test_size=test_size)
total_items_count = len(df.index)
val_length = total_items_count * val_size
new_val_propotion = val_length / len(temp.index)
train, val = train_test_split(temp, test_size=new_val_propotion)
return train, test, val