train_test_split-无法在块大小未知的Dask阵列上运行

时间:2019-03-31 15:36:48

标签: python dask dask-ml

我有一个文本分类数据集,我在其中使用了淡淡的镶木地板来节省磁盘空间,但是现在当我想将数据集分成训练并使用dask_ml.model_selection.train_test_split进行测试时遇到了问题。

ddf = dd.read_parquet('/storage/data/cleaned')
y = ddf['category'].values
X = ddf.drop('category', axis=1).values
train, test = train_test_split(X, y, test_size=0.2)

结果 TypeError: Cannot operate on Dask array with unknown chunk sizes.

感谢您的帮助。

1 个答案:

答案 0 :(得分:0)

这是我目前暂时要做的事情:

ddf = dd.read_parquet('/storage/data/cleaned')
ddf = ddf.to_dask_array(lengths=True)
train, test = train_test_split(ddf, test_size=0.2)

这将创建形状为dask.array<array, shape=(3937987, 2), dtype=object, chunksize=(49701, 2)>的dask.array