使用机器学习在Dask中循环变量

时间:2018-05-13 19:36:27

标签: python pandas loops dask

下午好。

我在Dask遇到轻微的打嗝。我对Dask(在Python中)相当新,我希望使用基于名为' class'的变量的纯Dask(而不是dask-ml)来运行并行ML模型。我希望知道模型的预测结果是什么,因为我改变了feature1(因为在我的数据中,我知道这是最重要的一个)。我在4节点集群上运行计算并保存predicted_all中所有类的预测值。

这是我的pandas数据df的一个例子:

+-------+--------+-------------+-------------+---+-------------+
| class | target |  feature1   |  feature2   | … |  feature10  |
+-------+--------+-------------+-------------+---+-------------+
| A     |      5 | 97.19859896 | 816.6842211 | … | 0.54895439  |
| A     |      6 | 46.09606585 | 784.3270075 | … | 8.251889349 |
| A     |     43 | 17.65188263 | 549.5501609 | … | 13.50763389 |
| A     |      2 | 98.85817622 | 708.1968399 | … | 7.621150619 |
| A     |     56 | 88.01917025 | 613.0401243 | … | 6.000443628 |
| B     |      4 | 70.80513786 | 906.0185026 | … | 19.41657943 |
| B     |     78 | 93.80801173 | 891.289798  | … | 8.501099853 |
| B     |      7 | 46.63101139 | 483.0638367 | … | 3.875892614 |
| B     |      1 | 67.5788966  | 743.5923161 | … | 8.671806546 |
| B     |      0 | 90.90392867 | 109.8205978 | … | 17.70970394 |
| …     |        | …           | …           | … | …           |
| Z     |     89 | 58.424834   | 794.9165579 | … | 17.51302389 |
| Z     |    854 | 58.21094669 | 714.8873807 | … | 3.334251242 |
| Z     |     25 | 75.5155099  | 61.59911771 | … | 8.507249536 |
| Z     |     90 | 47.13722692 | 861.3884932 | … | 11.95500215 |
| Z     |     52 | 9.824626526 | 528.1958297 | … | 10.10468804 |
+-------+--------+-------------+-------------+---+-------------+

df是一个pandas数据帧,我想在其上运行一个循环,以便在使用pandas数据帧时dask并行化每个节点上的工作。

根据Dask教程,我做如下:

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

classes = ['ABC...Z']
predicted_all = pd.DataFrame()

def linearmodel(class): #application of linear regression on data

    global predicted_all
    df_oneClass = df[df['class'] == class].drop(['class'], axis=1)
    df_y = pd.DataFrame(df_oneClass['target'])
    df_X = df_oneClass.drop(['target'], axis=1)

    model = LinearRegression()
    model.fit(df_X, df_y)

    X_predict = [] ## range of feature1 values I'd like to know the forecast outcome for
    y_predict = model.predict(X_predict)
    predicted_all = pd.concat([predicted_all, y_predict], axis = 0

results = [delayed(linearmodel)(class) for class in classes]
resultsDask = compute(*results, get=dask.multiprocessing.get)

根据MRocklin的建议,我进行了重写,并专注于获取每个类迭代的平均绝对误差并创建一个元组。我的功能现在如下:

def linearmodel(class): #application of linear regression on data

    df_oneClass = df[df['class'] == class].drop(['class'], axis=1)
    df_y = pd.DataFrame(df_oneClass['target'])
    df_X = df_oneClass.drop(['target'], axis=1)

    model = LinearRegression()
    model.fit(df_X, df_y)

    y_predict = model.predict(X_predict)
    mae = mean_absolute_error(df_y, y_predict)

    return (class, mae)

results = [delayed(linearmodel)(class) for class in classes]
resultsDask = compute(*results, get=dask.multiprocessing.get)

不幸的是,工作没有并行化,predict_all是空的。有关为什么请的任何线索?

感谢您抽出宝贵时间阅读本文,我们非常欢迎任何帮助或指示。

此致

基督教

1 个答案:

答案 0 :(得分:0)

I recommend not using global state and instead use functions that return values directly.