Spark数据帧中两行之间的差异

时间:2017-08-05 23:22:59

标签: scala apache-spark apache-spark-sql

我在Spark中创建了一个数据框,由groupby column1和date创建并计算了数量。

val table = df1.groupBy($"column1",$"date").sum("amount")
Column1 |Date   |Amount
A   |1-jul  |1000
A   |1-june |2000
A   |1-May  |2000
A   |1-dec  |3000
A   |1-Nov  |2000
B   |1-jul  |100
B   |1-june |300    
B   |1-May  |400
B   |1-dec  |300

现在,我想添加新列,表中任意两个日期的数量之间存在差异。

3 个答案:

答案 0 :(得分:13)

如果计算固定为计算前几个月之间的差异,或计算前两个月之间的 ...等,则可以使用Window功能。您可以将laglead功能与Window一起使用。

但为此您需要更改日期列,如下所示,以便订购。

+-------+------+--------------+------+
|Column1|Date  |Date_Converted|Amount|
+-------+------+--------------+------+
|A      |1-jul |2017-07-01    |1000  |
|A      |1-june|2017-06-01    |2000  |
|A      |1-May |2017-05-01    |2000  |
|A      |1-dec |2017-12-01    |3000  |
|A      |1-Nov |2017-11-01    |2000  |
|B      |1-jul |2017-07-01    |100   |
|B      |1-june|2017-06-01    |300   |
|B      |1-May |2017-05-01    |400   |
|B      |1-dec |2017-12-01    |300   |
+-------+------+--------------+------+

您可以通过

找到上个月和当月之间的差异
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Column1").orderBy("Date_Converted")
import org.apache.spark.sql.functions._
df.withColumn("diff_Amt_With_Prev_Month", $"Amount" - when((lag("Amount", 1).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 1).over(windowSpec)))
   .show(false)

你应该

+-------+------+--------------+------+------------------------+
|Column1|Date  |Date_Converted|Amount|diff_Amt_With_Prev_Month|
+-------+------+--------------+------+------------------------+
|B      |1-May |2017-05-01    |400   |400.0                   |
|B      |1-june|2017-06-01    |300   |-100.0                  |
|B      |1-jul |2017-07-01    |100   |-200.0                  |
|B      |1-dec |2017-12-01    |300   |200.0                   |
|A      |1-May |2017-05-01    |2000  |2000.0                  |
|A      |1-june|2017-06-01    |2000  |0.0                     |
|A      |1-jul |2017-07-01    |1000  |-1000.0                 |
|A      |1-Nov |2017-11-01    |2000  |1000.0                  |
|A      |1-dec |2017-12-01    |3000  |1000.0                  |
+-------+------+--------------+------+------------------------+

您可以将前两个月的滞后位置增加为

df.withColumn("diff_Amt_With_Prev_two_Month", $"Amount" - when((lag("Amount", 2).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 2).over(windowSpec)))
  .show(false)

会给你

+-------+------+--------------+------+----------------------------+
|Column1|Date  |Date_Converted|Amount|diff_Amt_With_Prev_two_Month|
+-------+------+--------------+------+----------------------------+
|B      |1-May |2017-05-01    |400   |400.0                       |
|B      |1-june|2017-06-01    |300   |300.0                       |
|B      |1-jul |2017-07-01    |100   |-300.0                      |
|B      |1-dec |2017-12-01    |300   |0.0                         |
|A      |1-May |2017-05-01    |2000  |2000.0                      |
|A      |1-june|2017-06-01    |2000  |2000.0                      |
|A      |1-jul |2017-07-01    |1000  |-1000.0                     |
|A      |1-Nov |2017-11-01    |2000  |0.0                         |
|A      |1-dec |2017-12-01    |3000  |2000.0                      |
+-------+------+--------------+------+----------------------------+

我希望答案很有帮助

答案 1 :(得分:1)

假设这两个日期属于您表格的每一组

我的进口商品:

<input class="btn_green_white_innerfade btn_medium" type="button" 
name="submit" id="userLogin" value="Sign in" width="104" height="25" 
border="0" tabindex="5" onclick="showDiv()">

制作数据框

<label for="userAccountName">username</label><br>
<input class="textField" type="text" name="username" 
id="steamAccountName" maxlength="64" tabindex="1" value=""><br>&nbsp;<br>

现在为你的案例写一个UDF,

<div class="auth_modal_h1">Hello <span 
id="login_twofactorauth_message_entercode_accountname"></span>!</div>
<p>This account is currently using a verification pin.</p>
        </div>

现在,准备输出

import org.apache.spark.sql.functions.{concat_ws,collect_list,lit}

希望,这就是你想要的。

答案 2 :(得分:0)

(table.filter($"Date".isin("1-jul", "1-dec"))
      .groupBy("Column1")
      .pivot("Date")
      .agg(first($"Amount"))
      .withColumn("diff", $"1-dec" - $"1-jul")
).show
+-------+-----+-----+----+
|Column1|1-dec|1-jul|diff|
+-------+-----+-----+----+
|      B|  300|  100| 200|
|      A| 3000| 1000|2000|
+-------+-----+-----+----+