对于给定的数据,如何计算5天均值,10天均值和15天均值?

时间:2019-02-27 17:00:00

标签: scala apache-spark apache-spark-sql spark-streaming

场景:

我有以下数据框

```     -- -----------------------------------
        companyId | calc_date   | mean   |
        ----------------------------------
        1111      | 01-08-2002  |  15    |
        ----------------------------------
        1111      | 02-08-2002  |  16.5   |
        ----------------------------------
        1111      | 03-08-2002  |  17     |
        ----------------------------------
        1111      | 04-08-2002  |  15     |
        ----------------------------------
        1111      | 05-08-2002  |  23     |
        ----------------------------------
        1111      | 06-08-2002  |  22.6   |
        ----------------------------------
        1111      | 07-08-2002  |  25     | 
        ----------------------------------
        1111      | 08-08-2002  |  15     |
        ----------------------------------
        1111      | 09-08-2002  |  15     |
        ----------------------------------
        1111      | 10-08-2002  |  16.5   |
        ----------------------------------
        1111      | 11-08-2002  |  22.6   |
        ----------------------------------
        1111      | 12-08-2002  |  15     |
        ----------------------------------
        1111      | 13-08-2002  |  16.5   |
        ----------------------------------
        1111      | 14-08-2002  |  25     |
        ----------------------------------
        1111      | 15-08-2002  |  16.5   |
        ----------------------------------

```

必填:

对于给定的数据,每家公司的每条记录需要计算5天平均值,10天平均值,15天平均值。

5 day-mean   -->  Previous 5 days available mean sum
10 day-mean  --> Previous 10 days available mean sum
15 day-mean  --> Previous 15 days available mean sum

结果数据框应具有如下计算列

        ----------------------------------------------------------------------------
        companyId | calc_date   | mean   |  5 day-mean | 10-day mean | 15-day mean |
        ----------------------------------------------------------------------------

问题:
    如何实现呢?     在spark中做到这一点的最佳方法是什么?

1 个答案:

答案 0 :(得分:2)

这是一种使用公司的窗口分区来通过class Days(models.Model) name = models.CharField(max_length=3, choices=( ('sat', "Saturday"), ('sun', "Sunday"), ('mon', "Monday"), ('tue', "Tuesday"), ('wed', "Wednesday"), ('thu', "Thursday"), ('fri', "Friday"), ), null=False, blank=False, primary_key=True) @classmethod def human_name_to_choice(cls, human_readable_str): return {b: a for a, b in cls._meta.get_field('name').choices}.get(human_readable_str, 'NUL') # ... Days.objects.get(name=Days.human_name_to_choice('Saturday')) 在指定的时间戳范围内计算当前行和先前行之间的n-day mean的方法,如下所示(使用虚拟数据集):

rangeBetween

请注意,如果保证日期是每天连续的时间序列,则可以直接在import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window import spark.implicits._ val df = (1 to 3).flatMap(i => Seq.tabulate(15)(j => (i, s"${j+1}-2-2019", j+1))). toDF("company_id", "calc_date", "mean") df.show // +----------+---------+----+ // |company_id|calc_date|mean| // +----------+---------+----+ // | 1| 1-2-2019| 1| // | 1| 2-2-2019| 2| // | 1| 3-2-2019| 3| // | 1| 4-2-2019| 4| // | 1| 5-2-2019| 5| // | ... | // | 1|14-2-2019| 14| // | 1|15-2-2019| 15| // | 2| 1-2-2019| 1| // | 2| 2-2-2019| 2| // | 2| 3-2-2019| 3| // | ... | // +----------+---------+----+ def winSpec = Window.partitionBy("company_id").orderBy("ts") def dayRange(days: Int) = winSpec.rangeBetween(-(days * 24 * 60 * 60), 0) df. withColumn("ts", unix_timestamp(to_date($"calc_date", "d-M-yyyy"))). withColumn("mean-5", mean($"mean").over(dayRange(5))). withColumn("mean-10", mean($"mean").over(dayRange(10))). withColumn("mean-15", mean($"mean").over(dayRange(15))). show // +----------+---------+----+----------+------+-------+-------+ // |company_id|calc_date|mean| ts|mean-5|mean-10|mean-15| // +----------+---------+----+----------+------+-------+-------+ // | 1| 1-2-2019| 1|1549008000| 1.0| 1.0| 1.0| // | 1| 2-2-2019| 2|1549094400| 1.5| 1.5| 1.5| // | 1| 3-2-2019| 3|1549180800| 2.0| 2.0| 2.0| // | 1| 4-2-2019| 4|1549267200| 2.5| 2.5| 2.5| // | 1| 5-2-2019| 5|1549353600| 3.0| 3.0| 3.0| // | 1| 6-2-2019| 6|1549440000| 3.5| 3.5| 3.5| // | 1| 7-2-2019| 7|1549526400| 4.5| 4.0| 4.0| // | 1| 8-2-2019| 8|1549612800| 5.5| 4.5| 4.5| // | 1| 9-2-2019| 9|1549699200| 6.5| 5.0| 5.0| // | 1|10-2-2019| 10|1549785600| 7.5| 5.5| 5.5| // | 1|11-2-2019| 11|1549872000| 8.5| 6.0| 6.0| // | 1|12-2-2019| 12|1549958400| 9.5| 7.0| 6.5| // | 1|13-2-2019| 13|1550044800| 10.5| 8.0| 7.0| // | 1|14-2-2019| 14|1550131200| 11.5| 9.0| 7.5| // | 1|15-2-2019| 15|1550217600| 12.5| 10.0| 8.0| // | 3| 1-2-2019| 1|1549008000| 1.0| 1.0| 1.0| // | 3| 2-2-2019| 2|1549094400| 1.5| 1.5| 1.5| // | 3| 3-2-2019| 3|1549180800| 2.0| 2.0| 2.0| // | 3| 4-2-2019| 4|1549267200| 2.5| 2.5| 2.5| // | 3| 5-2-2019| 5|1549353600| 3.0| 3.0| 3.0| // +----------+---------+----+----------+------+-------+-------+ // only showing top 20 rows 上使用rowsBetween(而不是rangeBetween)。