如何在 dask 分布式集群中进行 dask_ml 预处理?我的数据集大约 200GB,每次我对准备 OneHotEncoding 的数据集进行分类时,看起来 dask 都在忽略客户端并尝试将数据集加载到本地机器的内存中。也许我错过了什么:
from dask_ml.preprocessing import Categorizer, DummyEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
import pandas as pd
import dask.dataframe as dd
df = dd.read_csv('s3://some-bucket/files*.csv', dtypes={'column': 'category'})
pipe = make_pipeline(
Categorizer(),
DummyEncoder(),
LogisticRegression(solver='lbfgs')
)
pipe.fit(df, y)