我可以用Pyspark中的加权移动平均线进行预测吗?
假设我有一个类似的数据框:
+--------+---------------+------------+--------+-----------+---------------+-----------+
| MKT| MANUFACTURER| FRANCHISE| BRAND| SUBBRAND| Sales|date_format|
+--------+---------------+------------+--------+-----------+---------------+-----------+
|Market A| Competitor A| FR AA| BR K| SBR AR| 0.0| 2015-09-19|
|Market A| Competitor A| FR AA| BR K| SBR AS| 0.0| 2016-06-25|
|Market A| Competitor A| FR AA| BR K| SBR AT| 0.0| 2015-10-24|
|Market A| Competitor A| FR AA| BR K| SBR AT| 0.0| 2015-11-28|
|Market A| Competitor A| FR Y| BR H| SBR AD| 0.0| 2015-02-21|
|Market A| Competitor A| FR Y| BR H| SBR AE| 0.0| 2016-08-13|
|Market A| Competitor A| FR Y| BR H| SBR AG| 0.0| 2015-02-07|
|Market A| Competitor A| FR Y| BR H| SBR Y| 0.0| 2016-06-04|
|Market A| Competitor A| FR Y| BR I| SBR AI| 0.0| 2016-07-16|
|Market A| Competitor A| FR Y| BR I| SBR AP| 0.0| 2016-04-30|
|Market A| Competitor B| FR AB| BR L| SBR AU| 20677.0| 2016-03-05|
|Market A| Competitor B| FR AB| BR N| SBR AY| 0.0| 2015-11-07|
|Market A| Competitor E| FR AF| BR R| SBR BC| 5834.0| 2016-07-09|
|Market A| Competitor E| FR AF| BR S| SBR BD| 1664.0| 2015-06-20|
|Market A| My Company| FR W| BR D| SBR H| 0.0| 2016-02-27|
|Market A| My Company| FR W| BR E| SBR K| 10355.0| 2015-09-19|
|Market A| My Company| FR W| BR E| SBR O| 0.0| 2015-12-26|
|Market A| My Company| FR W| BR E| SBR S| 0.0| 2016-01-23|
|Market A| My Company| FR W| BR E| SBR T| 0.0| 2015-09-19|
|Market A| My Company| FR X| BR G| SBR V| 0.0| 2015-02-07|
+--------+---------------+------------+--------+-----------+---------------+-----------+
如果我的上一个日期是2016-08-13
并且我想预测接下来的四个数据点,我应该怎么做呢?
我无法使用新日期添加更多行。
我想预测接下来的4个点,所以目前我在另一个列中采用Sales
的滚动平均值,然后基于新列进行外推,这是一个错误的方法。
我的加权移动平均函数(source):
def weighted_average(c, window, offsets, weights):
assert len(weights) == len(offsets)
def value(i):
if i < 0:
return F.lag(c, -i).over(window)
if i > 0:
return F.lead(c, i).over(window)
return c
# Create a list of Columns
# - `value_i * weight_i` if `value_i IS NOT NULL`
# - literal 0 otherwise
values = [F.coalesce(value(i) * w, F.lit(0)) for i, w in zip(offsets, weights)]
# or sum(values, lit(0))
return reduce(add, values, F.lit(0))
我使用它的方式:
og_df = original_data.withColumn('week_1_pred', weighted_average(F.col('Sales'), w, offsets, delays))
og_df = og_df.withColumn('week_2_pred', weighted_average(F.col('week_1_pred'), w, offsets, delays))
og_df = og_df.withColumn('week_3_pred', weighted_average(F.col('week_2_pred'), w, offsets, delays))
og_df = og_df.withColumn('week_4_pred', weighted_average(F.col('week_3_pred'), w, offsets, delays))
结果数据框是:
+--------+---------------+------------+--------+-----------+---------------+-----------+------------------+------------------+------------------+------------------+
| MKT| MANUFACTURER| FRANCHISE| BRAND| SUBBRAND| Sales|date_format| week_1_pred| week_2_pred| week_3_pred| week_4_pred|
+--------+---------------+------------+--------+-----------+---------------+-----------+------------------+------------------+------------------+------------------+
|Market C| Competitor E| FR AF| BR R| SBR BC| 28431.0| 2015-01-10| 0.0| 0.0| 0.0| 0.0|
|Market C| Competitor E| FR AF| BR R| SBR BC| 28988.0| 2015-01-17| 8529.3| 0.0| 0.0| 0.0|
|Market C| Competitor E| FR AF| BR R| SBR BC| 34777.0| 2015-01-24|17225.699999999997|2558.7899999999995| 0.0| 0.0|
|Market C| Competitor E| FR AF| BR R| SBR BC| 36580.0| 2015-01-31| 24815.7| 7726.499999999998| 767.6369999999998| 0.0|
|Market C| Competitor E| FR AF| BR R| SBR BC| 42142.0| 2015-02-07|30047.799999999996|14318.279999999999| 3085.586999999999|230.29109999999994|
|Market C| Competitor E| FR AF| BR R| SBR BC| 44354.0| 2015-02-14|34892.350000000006| 20757.12| 7125.191999999999|1155.9671999999996|
|Market C| Competitor E| FR AF| BR R| SBR BC| 46517.0| 2015-02-21|39613.450000000004|26594.219999999998|12323.798999999999|3216.7610999999993|
|Market C| Competitor E| FR AF| BR R| SBR BC| 45123.0| 2015-02-28| 42535.95|32130.620000000003| 17969.6475| 6528.5784|
|Market C| Competitor E| FR AF| BR R| SBR BC| 45031.0| 2015-03-07| 44144.85| 36730.14| 23714.9685|10860.012899999998|
|Market C| Competitor E| FR AF| BR R| SBR BC| 42457.0| 2015-03-14| 44721.1|40159.340000000004| 29155.023| 15875.325|
|Market C| Competitor E| FR AF| BR R| SBR BC| 43195.0| 2015-03-21| 44247.49999999999| 42375.3275|33906.159999999996|21197.845800000003|
|Market C| Competitor E| FR AF| BR R| SBR BC| 41085.0| 2015-03-28| 43757.65| 43498.435| 37687.05725|26430.762899999998|
|Market C| Competitor E| FR AF| BR R| SBR BC| 35191.0| 2015-04-04| 42860.5| 43867.72| 40403.25275000001|31195.138949999997|
|Market C| Competitor E| FR AF| BR R| SBR BC| 37804.0| 2015-04-11|40275.200000000004| 43641.095| 42143.884| 35208.0581|
|Market C| Competitor E| FR AF| BR R| SBR BC| 34953.0| 2015-04-18| 38809.4| 42560.2875| 43034.33824999999| 38335.66804999999|
|Market C| Competitor E| FR AF| BR R| SBR BC| 32382.0| 2015-04-25| 37256.4| 41121.675| 43110.535625| 40555.88209999999|
|Market C| Competitor E| FR AF| BR R| SBR BC| 34295.0| 2015-05-02| 35494.4| 39561.0875| 42513.267875| 41892.22509999999|
|Market C| Competitor E| FR AF| BR R| SBR BC| 31506.0| 2015-05-09|34587.899999999994| 37945.5475| 41449.3035|42412.912599999996|
|Market C| Competitor E| FR AF| BR R| SBR BC| 29105.0| 2015-05-16| 33361.75| 36513.695| 40107.795| 42241.6692|
|Market C| Competitor E| FR AF| BR R| SBR BC| 29092.0| 2015-05-23|31918.350000000002| 35163.645| 38672.22687500001| 41539.7478|
+--------+---------------+------------+--------+-----------+---------------+-----------+------------------+------------------+------------------+------------------+
only showing top 20 rows
请帮帮我。