下午好。
我在Dask遇到轻微的打嗝。我对Dask(在Python中)相当新,我希望使用基于名为' class'的变量的纯Dask(而不是dask-ml)来运行并行ML模型。我希望知道模型的预测结果是什么,因为我改变了feature1(因为在我的数据中,我知道这是最重要的一个)。我在4节点集群上运行计算并保存predicted_all
中所有类的预测值。
这是我的pandas数据df的一个例子:
+-------+--------+-------------+-------------+---+-------------+
| class | target | feature1 | feature2 | … | feature10 |
+-------+--------+-------------+-------------+---+-------------+
| A | 5 | 97.19859896 | 816.6842211 | … | 0.54895439 |
| A | 6 | 46.09606585 | 784.3270075 | … | 8.251889349 |
| A | 43 | 17.65188263 | 549.5501609 | … | 13.50763389 |
| A | 2 | 98.85817622 | 708.1968399 | … | 7.621150619 |
| A | 56 | 88.01917025 | 613.0401243 | … | 6.000443628 |
| B | 4 | 70.80513786 | 906.0185026 | … | 19.41657943 |
| B | 78 | 93.80801173 | 891.289798 | … | 8.501099853 |
| B | 7 | 46.63101139 | 483.0638367 | … | 3.875892614 |
| B | 1 | 67.5788966 | 743.5923161 | … | 8.671806546 |
| B | 0 | 90.90392867 | 109.8205978 | … | 17.70970394 |
| … | | … | … | … | … |
| Z | 89 | 58.424834 | 794.9165579 | … | 17.51302389 |
| Z | 854 | 58.21094669 | 714.8873807 | … | 3.334251242 |
| Z | 25 | 75.5155099 | 61.59911771 | … | 8.507249536 |
| Z | 90 | 47.13722692 | 861.3884932 | … | 11.95500215 |
| Z | 52 | 9.824626526 | 528.1958297 | … | 10.10468804 |
+-------+--------+-------------+-------------+---+-------------+
df是一个pandas数据帧,我想在其上运行一个循环,以便在使用pandas数据帧时dask并行化每个节点上的工作。
根据Dask教程,我做如下:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
classes = ['ABC...Z']
predicted_all = pd.DataFrame()
def linearmodel(class): #application of linear regression on data
global predicted_all
df_oneClass = df[df['class'] == class].drop(['class'], axis=1)
df_y = pd.DataFrame(df_oneClass['target'])
df_X = df_oneClass.drop(['target'], axis=1)
model = LinearRegression()
model.fit(df_X, df_y)
X_predict = [] ## range of feature1 values I'd like to know the forecast outcome for
y_predict = model.predict(X_predict)
predicted_all = pd.concat([predicted_all, y_predict], axis = 0
results = [delayed(linearmodel)(class) for class in classes]
resultsDask = compute(*results, get=dask.multiprocessing.get)
根据MRocklin的建议,我进行了重写,并专注于获取每个类迭代的平均绝对误差并创建一个元组。我的功能现在如下:
def linearmodel(class): #application of linear regression on data
df_oneClass = df[df['class'] == class].drop(['class'], axis=1)
df_y = pd.DataFrame(df_oneClass['target'])
df_X = df_oneClass.drop(['target'], axis=1)
model = LinearRegression()
model.fit(df_X, df_y)
y_predict = model.predict(X_predict)
mae = mean_absolute_error(df_y, y_predict)
return (class, mae)
results = [delayed(linearmodel)(class) for class in classes]
resultsDask = compute(*results, get=dask.multiprocessing.get)
不幸的是,工作没有并行化,predict_all是空的。有关为什么请的任何线索?
感谢您抽出宝贵时间阅读本文,我们非常欢迎任何帮助或指示。
此致
基督教
答案 0 :(得分:0)
I recommend not using global state and instead use functions that return values directly.