scikit通过意外拆分数据来学习train_test_split()

时间:2020-07-08 18:01:05

标签: python scikit-learn python-3.6 python-3.7 train-test-split

我正面临着这样的问题,即在大型数据集的情况下,sklearn的train_test_split()会突然分割数据集。我正在尝试加载118 MB的整个数据集,并且它分配的测试数据少于代码期望值的10倍。

案例1:6万个数据点

    #loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv',nrows=60000)
data.shape
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y,random_state=0)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)

输出: (40200,8)(40200,) (19800,8)(19800,)

案例2:109,000数据点

    #loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv')
print(data1.shape)
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y,random_state=123)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)

输出: (109248,9) (90552,8)(90552,) (1460,8)(1460,)

与情况2相比,任何超过60K的数据点都会突然变为90K和1.4K。我尝试过更改随机状态,删除随机状态,将数据集移动到新位置,但是问题似乎相同。

0 个答案:

没有答案