Question

我正试图为我的数据集的每个分区拟合一个ML模型，但我不知道如何在Spark中做到这一点。

我的数据集基本上看起来像这样，并且已按公司分区：

Company | Features | Target

A         xxx        0.9
A         xxx        0.8
A         xxx        1.0
B         xxx        1.2
B         xxx        1.0
B         xxx        0.9
C         xxx        0.7
C         xxx        0.9
C         xxx        0.9

我的目标是以并行的方式为每家公司训练一个回归器（我有几亿条记录，有10万家公司）。我的直觉是，我需要使用foreachPartition来并行处理分区（即我的公司），并训练和保存每个公司模型。 我的主要问题是如何处理iterator调用的函数中使用的foreachPartition类型。

这是它的样子：

dd.foreachPartition(

    iterator => {var company_df = operator.toDF()
                 var rg = RandomForestRegressor()
                                 .setLabelCol("target")
                                 .setFeaturesCol("features")
                                 .setNumTrees(10)
                 var model = rg.fit(company_df)
                 model.write.save(company_path)
                 }
)

据我了解，尝试将iterator转换为dataframe是不可能的，因为RDD的概念本身不能在foreachPartition语句中存在。

我知道这个问题还很悬而未决，但是我真的很困惑。

Answer 1

在pyspark中，您可以执行以下操作

import statsmodels.api as sm
# df has four columns: id, y, x1, x2

group_column = 'id'
y_column = 'y'
x_columns = ['x1', 'x2']
schema = df.select(group_column, *x_columns).schema

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
# Input/output are both a pandas.DataFrame
def ols(pdf):
    group_key = pdf[group_column].iloc[0]
    y = pdf[y_column]
    X = pdf[x_columns]
      X = sm.add_constant(X)
    model = sm.OLS(y, X).fit()

    return pd.DataFrame([[group_key] + [model.params[i] for i in   x_columns]], columns=[group_column] + x_columns)

beta = df.groupby(group_column).apply(ols)

如何使用foreachPartition在Spark中为每个分区高效地构建一个ML模型？

1 个答案: