Question

我正在编写一些代码，这些代码使用pyspark等渗回归包等高地平滑数据框中的许多曲线。

我编写了一个令人讨厌的慢循环，该循环可以实现我想要的功能，但是我还没有进入一个巨大的瓶颈并且需要改进的阶段。代码如下：

l = [('x', 'a', 1, 2.1),
     ('x', 'a', 2, 3.0),
     ('x', 'a', 3, 3.6),
     ('x', 'a', 4, 3.4),
     ('x', 'a', 5, 4.0),
     ('x', 'a', 6, 5.0),
     ('x', 'b', 1, 1.1),
     ('x', 'b', 2, 1.5),
     ('x', 'b', 3, 1.4),
     ('x', 'b', 4, 1.6),
     ('x', 'b', 5, 1.8),
     ('x', 'b', 6, 2.0),
     ('y', 'a', 1, 10.0),
     ('y', 'a', 2, 10.1),
     ('y', 'a', 3, 11.2),
     ('y', 'a', 4, 10.2),
     ('y', 'a', 5, 12.0),
     ('y', 'a', 6, 13.0)]
df = sqlContext.createDataFrame(l, ['category_1', 'category_2', 'markdown', 'uplift'])
df = df.withColumn('markdown', df["markdown"].cast('double'))

isomodel = IsotonicRegression(featuresCol='markdown', labelCol='uplift', predictionCol='smoothed_uplift')
unique_refs = df.select('category_1', 'category_2').distinct()
unique_refs = unique_refs.toPandas()

for row in range(len(unique_refs)):
        cat_1 = unique_refs.loc[row,'category_1']
        cat_2 = unique_refs.loc[row,'category_2']
        print("processing ", cat_1, cat_2)
        temp_df = df.filter((df['category_1']==cat_1)&\
                              (df['category_2']==cat_2))
        print(temp_df.show())
        ir_model = isomodel.fit(temp_df)
        if row==0:
            smoothed_df = ir_model.transform(temp_df)
        else:
            smoothed_df_temp = ir_model.transform(temp_df)
            smoothed_df = smoothed_df.union(smoothed_df_temp)

我得到（并且仍然想得到）的结果如下：

+----------+----------+--------+------+---------------+
|category_1|category_2|markdown|uplift|smoothed_uplift|
+----------+----------+--------+------+---------------+
|         x|         b|     1.0|   1.1|            1.1|
|         x|         b|     2.0|   1.5|           1.45|
|         x|         b|     3.0|   1.4|           1.45|
|         x|         b|     4.0|   1.6|            1.6|
|         x|         b|     5.0|   1.8|            1.8|
|         x|         b|     6.0|   2.0|            2.0|
|         y|         a|     1.0|  10.0|           10.0|
|         y|         a|     2.0|  10.1|           10.1|
|         y|         a|     3.0|  11.2|           10.7|
|         y|         a|     4.0|  10.2|           10.7|
|         y|         a|     5.0|  12.0|           12.0|
|         y|         a|     6.0|  13.0|           13.0|
|         x|         a|     1.0|   2.1|            2.1|
|         x|         a|     2.0|   3.0|            3.0|
|         x|         a|     3.0|   3.6|            3.5|
|         x|         a|     4.0|   3.4|            3.5|
|         x|         a|     5.0|   4.0|            4.0|
|         x|         a|     6.0|   5.0|            5.0|
+----------+----------+--------+------+---------------+

我知道一定有很好的方法可以做到这一点，但我不知道从哪里开始！

并行训练和应用多个等渗回归

0 个答案: