如何在 dask 分布式集群中使用 dask_ml 预处理

时间:2021-07-09 15:19:20

标签: dask dask-distributed dask-delayed dask-dataframe dask-ml

如何在 dask 分布式集群中进行 dask_ml 预处理?我的数据集大约 200GB,每次我对准备 OneHotEncoding 的数据集进行分类时,看起来 dask 都在忽略客户端并尝试将数据集加载到本地机器的内存中。也许我错过了什么:

from dask_ml.preprocessing import Categorizer, DummyEncoder

from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import make_pipeline

import pandas as pd

import dask.dataframe as dd

df = dd.read_csv('s3://some-bucket/files*.csv', dtypes={'column': 'category'})
    
pipe = make_pipeline(
   Categorizer(),
   DummyEncoder(),
   LogisticRegression(solver='lbfgs')
)


pipe.fit(df, y)

0 个答案:

没有答案