Spark:如何对数据帧执行循环功能

时间:2018-04-06 08:41:28

标签: sql scala function apache-spark dataframe

我有两个数据帧,如下所示,我正在尝试使用外键搜索第二个df,然后生成一个新的数据帧。我正在考虑做一个spark.sql("""select history.value as previous_year 1 from df1, history where df1.key=history.key and history.date=add_months($currentdate,-1*12)"""但是我需要多次这样做10 previous_year s。并将它们重新组合在一起。如何为此创建功能?非常感谢。这里相当新鲜。

dataframe one:
   +---+---+-----------+
   |key|val| date      |
   +---+---+-----------+
   |  1|100| 2018-04-16|
   |  2|200| 2018-04-16| 
   +---+---+-----------+
dataframe two : historical data
   +---+---+-----------+
   |key|val| date      |
   +---+---+-----------+
   |  1|10 | 2017-04-16|
   |  1|20 | 2016-04-16| 
   +---+---+-----------+

我想要生成的结果是

   +---+----------+-----------------+-----------------+
   |key|date      | previous_year_1 | previous_year_2 |
   +---+----------+-----------------+-----------------+
   |  1|2018-04-16| 10              | 20              |
   |  2|null      | null            | null            |
   +---+----------+-----------------+-----------------+

2 个答案:

答案 0 :(得分:1)

要解决此问题,可以应用以下方法:

1)按key加入两个数据帧。

2)过滤掉之前日期不是参考日期前几年的所有行。

3)计算行的年份差异并将值放在专用列中。

4)围绕上一步计算的列旋转DataFrame,并汇总相应年份的值。

private def generateWhereForPreviousYears(nbYears: Int): Column =
  (-1 to -nbYears by -1) // loop on each backwards year value
    .map(yearsBack => 
    /*
      * Each year back count number is transformed in an expression
      * to be included into the WHERE clause.
      * This is equivalent to "history.date=add_months($currentdate,-1*12)"
      * in your comment in the question.
      */
    add_months($"df1.date", 12 * yearsBack) === $"df2.date"
  )
    /*
    The previous .map call produces a sequence of Column expressions,
    we need to concatenate them with "or" in order to obtain
    a single Spark Column reference. .reduce() function is most
    appropriate here.
     */
    .reduce(_ or _) or $"df2.date".isNull // the last "or" is added to include empty lines in the result.

val nbYearsBack = 3

val result = sourceDf1.as("df1")
  .join(sourceDf2.as("df2"), $"df1.key" === $"df2.key", "left")
  .where(generateWhereForPreviousYears(nbYearsBack))
  .withColumn("diff_years", concat(lit("previous_year_"), year($"df1.date") - year($"df2.date")))
  .groupBy($"df1.key", $"df1.date")
  .pivot("diff_years")
  .agg(first($"df2.value"))
  .drop("null") // drop the unwanted extra column with null values

输出结果为:

+---+----------+---------------+---------------+
|key|date      |previous_year_1|previous_year_2|
+---+----------+---------------+---------------+
|1  |2018-04-16|10             |20             |
|2  |2018-04-16|null           |null           |
+---+----------+---------------+---------------+

答案 1 :(得分:1)

让我“仔细阅读”并为您提供一个“类似”的解决方案:

flush()

如果您确实需要更改列名,请随意操作数据。

甚至更好:

val df1Pivot = df1.groupBy("key").pivot("date").agg(max("val"))
val df2Pivot = df2.groupBy("key").pivot("date").agg(max("val"))

val result = df1Pivot.join(df2Pivot, Seq("key"), "left")
result.show

+---+----------+----------+----------+                                          
|key|2018-04-16|2016-04-16|2017-04-16|
+---+----------+----------+----------+
|  1|       100|        20|        10|
|  2|       200|      null|      null|
+---+----------+----------+----------+