Question

我正在尝试按数据帧（约18.8万行）拆分为训练样本和测试样本。列（'FLAG'）是我的目标变量，其中包含0或1的值。

由于值1大约只有1300个'FLAG'，因此我想进行分层拆分以确保两个样本中的代表值都具有代表性的1个值。

我尝试使用sklearn的train_test_split函数进行拆分：

train, test = train_test_split(df, test_size=0.2, stratify=df["FLAG"])

我的问题是，生成的样本和测试样本分别有177942行和52行。我本来希望有150400和37600行。

通过阅读文档（sklearn.model_selection.train_test_split），我的理解是我必须提供数据框，test_size和包含目标类（在我的情况下为'FLAG'）的列。

即使是一般示例：

df = pd.DataFrame(data={'a': np.random.rand(100000), 'b': np.random.rand(100000), 'c': 0})
df.loc[np.random.randint(0, 100000, 1000), 'c'] = 1
tr, ts = train_test_split(df, test_size=.2, stratify=df['c'])
print(tr.shape, ts.shape)

返回：(93105, 3) (38, 3)

我的进口清单：

import cx_Oracle
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

我的python版本：3.7.0 Sklearn版本：0.20.3 熊猫版本：0.23.4

Answer 1

我的调查表明，此问题是由整数溢出引起的。该问题仅在Python 3.7.x 32位上发生。 64位版本可以正常工作。

最后，我切换到64位Python来解决此问题（由于与Oracle软件包的依赖性不相关，我以前不得不使用32位版本）。

带分层的Test_train_split

1 个答案: