我正面临着这样的问题,即在大型数据集的情况下,sklearn的train_test_split()会突然分割数据集。我正在尝试加载118 MB的整个数据集,并且它分配的测试数据少于代码期望值的10倍。
案例1:6万个数据点
#loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv',nrows=60000)
data.shape
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y,random_state=0)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)
输出: (40200,8)(40200,) (19800,8)(19800,)
案例2:109,000数据点
#loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv')
print(data1.shape)
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y,random_state=123)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)
输出: (109248,9) (90552,8)(90552,) (1460,8)(1460,)
与情况2相比,任何超过60K的数据点都会突然变为90K和1.4K。我尝试过更改随机状态,删除随机状态,将数据集移动到新位置,但是问题似乎相同。