当我使用dask-ml训练 PartialSGDRegressor 时,我遇到了问题。假设你使用 fit 方法训练模型,dask-ml使用sklearn固有的内部partial_fit,但它似乎从内存中的数据集加载所有数据,并且它不使用块来进行增量训练。
为什么不进行增量训练?
示例:
import dask.array as da
import dask.dataframe as dd
import dask_ml.linear_model
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler
# path comes from:
# http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.
path = '...... csv'
df = dd.read_csv(ruta, parse_dates=['tpep_dropoff_datetime', 'tpep_pickup_datetime'], blocksize=25e6)
xd=x[['trip_distance','passenger_count','PULocationID','DOLocationID']].to_delayed()
xfull = [da.from_delayed(i, i.compute().shape, i.compute().dtypes) for i in xd]
xf = da.concatenate(xfull)
yd = x['fare_amount'].to_delayed()
yfull = [da.from_delayed(i, i.compute().shape, i.compute().dtypes) for i in yd]
yf = da.concatenate(yfull)
with Profiler() as prof, ResourceProfiler(dt=1) as rprof, CacheProfiler() as cprof:
est = dask_ml.linear_model.PartialSGDRegressor(loss='squared_loss',penalty='l1',alpha=0.1,learning_rate='constant',eta0=0.000001,max_iter=10)
est.fit(xf,yf)
如果我自己制作管道,它可以正常使用少于5GB RAM