Question

当我使用dask-ml训练 PartialSGDRegressor 时，我遇到了问题。假设你使用 fit 方法训练模型，dask-ml使用sklearn固有的内部partial_fit，但它似乎从内存中的数据集加载所有数据，并且它不使用块来进行增量训练。

为什么不进行增量训练？

记忆报告

示例：

import dask.array as da
import dask.dataframe as dd
import dask_ml.linear_model
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler

# path comes from: 
# http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.

path = '...... csv'

df = dd.read_csv(ruta, parse_dates=['tpep_dropoff_datetime', 'tpep_pickup_datetime'], blocksize=25e6)
xd=x[['trip_distance','passenger_count','PULocationID','DOLocationID']].to_delayed()
xfull = [da.from_delayed(i, i.compute().shape, i.compute().dtypes) for i in xd]
xf = da.concatenate(xfull)

yd = x['fare_amount'].to_delayed()
yfull = [da.from_delayed(i, i.compute().shape, i.compute().dtypes) for i in yd]
yf = da.concatenate(yfull)

with Profiler() as prof, ResourceProfiler(dt=1) as rprof, CacheProfiler() as cprof:
   est = dask_ml.linear_model.PartialSGDRegressor(loss='squared_loss',penalty='l1',alpha=0.1,learning_rate='constant',eta0=0.000001,max_iter=10)

   est.fit(xf,yf)

当csv占用35.3 GB时会发生这种情况

如果我自己制作管道，它可以正常使用少于5GB RAM

当它应该按块加载块时，Dask-ml将所有数据加载到内存RAM中

0 个答案: