当它应该按块加载块时,Dask-ml将所有数据加载到内存RAM中

时间:2018-03-07 11:15:32

标签: python scikit-learn dask

当我使用dask-ml训练 PartialSGDRegressor 时,我遇到了问题。假设你使用 fit 方法训练模型,dask-ml使用sklearn固有的内部partial_fit,但它似乎从内存中的数据集加载所有数据,并且它不使用块来进行增量训练。

为什么不进行增量训练?

记忆报告 enter image description here

示例:

import dask.array as da
import dask.dataframe as dd
import dask_ml.linear_model
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler

# path comes from: 
# http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.

path = '...... csv'

df = dd.read_csv(ruta, parse_dates=['tpep_dropoff_datetime', 'tpep_pickup_datetime'], blocksize=25e6)
xd=x[['trip_distance','passenger_count','PULocationID','DOLocationID']].to_delayed()
xfull = [da.from_delayed(i, i.compute().shape, i.compute().dtypes) for i in xd]
xf = da.concatenate(xfull)

yd = x['fare_amount'].to_delayed()
yfull = [da.from_delayed(i, i.compute().shape, i.compute().dtypes) for i in yd]
yf = da.concatenate(yfull)

with Profiler() as prof, ResourceProfiler(dt=1) as rprof, CacheProfiler() as cprof:
   est = dask_ml.linear_model.PartialSGDRegressor(loss='squared_loss',penalty='l1',alpha=0.1,learning_rate='constant',eta0=0.000001,max_iter=10)

   est.fit(xf,yf)

当csv占用35.3 GB时会发生这种情况 enter image description here

如果我自己制作管道,它可以正常使用少于5GB RAM

0 个答案:

没有答案