将2个DF与不同尺寸的scala连接

时间:2018-08-18 18:44:34

标签: scala

嗨,我有2个差速器

scala> d1.show()               scala> d2.show()
+--------+-------+             +--------+----------+
|   fecha|eventos|             |   fecha|TotalEvent|
+--------+-------+             +--------+----------+
|20180404|      3|             |       0|     23534|
|20180405|      7|             |20180322|        10|
|20180406|     10|             |20180326|        50|
|20180409|      4|             |20180402|         6|
....                           |20180403|       118|
scala> d1.count()              |20180404|      1110|
res3: Long = 60                 ...
                               scala> d2.count()
                               res7: Long = 74

但是我喜欢通过fecha将它们加入,而不会丢失数据,然后使用数学运算(TotalEvent-eventos)* 100 / TotalEvent创建一个新列

类似这样的东西:

+---------+-------+----------+--------+
|fecha    |eventos|TotalEvent|  KPI   |
+---------+-------+----------+--------+
|        0|       |    23534 |  100.00|
| 20180322|       |       10 |  100.00|
| 20180326|       |       50 |  100.00|
| 20180402|       |        6 |  100.00|
| 20180403|       |      118 |  100.00|
| 20180404|     3 |     1110 |   99.73|
| 20180405|     7 |     1204 |   99.42|
| 20180406|    10 |     1526 |   99.34|
| 20180407|       |       14 |  100.00|
| 20180409|     4 |     1230 |   99.67|
| 20180410|    11 |     1456 |   99.24|
| 20180411|     6 |     1572 |   99.62|
| 20180412|     5 |     1450 |   99.66|
| 20180413|     7 |     1214 |   99.42|
 .....

问题在于我找不到解决方法。 当我使用时:

scala> d1.join(d2,d2("fecha").contains(d1("fecha")), "left").show()

我丢失了两个表中都不存在的数据。

+--------+-------+--------+----------+
|   fecha|eventos|   fecha|TotalEvent|
+--------+-------+--------+----------+
|20180404|      3|20180404|      1110|
|20180405|      7|20180405|      1204|
|20180406|     10|20180406|      1526|
|20180409|      4|20180409|      1230|
|20180410|     11|20180410|      1456|
 ....

另外,如何通过数学运算添加新列?

谢谢

3 个答案:

答案 0 :(得分:1)

我建议将leftdf2一起加入df1,然后根据KPI在加入的数据集中是否为空来计算eventos(使用{ {1}}):

when/otherwise

请注意,如果要使用更精确的原始import org.apache.spark.sql.functions._ val df1 = Seq( ("20180404", 3), ("20180405", 7), ("20180406", 10), ("20180409", 4) ).toDF("fecha", "eventos") val df2 = Seq( ("0", 23534), ("20180322", 10), ("20180326", 50), ("20180402", 6), ("20180403", 118), ("20180404", 1110), ("20180405", 100), ("20180406", 100) ).toDF("fecha", "TotalEvent") df2. join(df1, Seq("fecha"), "left_outer"). withColumn( "KPI", round( when($"eventos".isNull, 100.0). otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"), 2 ) ).show // +--------+----------+-------+-----+ // | fecha|TotalEvent|eventos| KPI| // +--------+----------+-------+-----+ // | 0| 23534| null|100.0| // |20180322| 10| null|100.0| // |20180326| 50| null|100.0| // |20180402| 6| null|100.0| // |20180403| 118| null|100.0| // |20180404| 1110| 3|99.73| // |20180405| 100| 7| 93.0| // |20180406| 100| 10| 90.0| // +--------+----------+-------+-----+ ,只需除去包装的KPI

答案 1 :(得分:0)

我将分几个步骤进行操作。首先加入,然后选择计算列,然后填写na:

@ val df2a = df2.withColumnRenamed("fecha", "fecha2")  # to avoid ambiguous column names after the join

@ val df3 = df1.join(df2a, df1("fecha") === df2a("fecha2"), "outer")

@ val kpi = df3.withColumn("KPI", (($"TotalEvent" - $"eventos") / $"TotalEvent" * 100 as "KPI")).na.fill(100, Seq("KPI"))

@ kpi.show()
+--------+-------+--------+----------+-----------------+
|   fecha|eventos|  fecha2|TotalEvent|              KPI|
+--------+-------+--------+----------+-----------------+
|    null|   null|20180402|         6|            100.0|
|    null|   null|       0|     23534|            100.0|
|    null|   null|20180322|        10|            100.0|
|20180404|      3|20180404|      1110|99.72972972972973|
|20180406|     10|    null|      null|            100.0|
|    null|   null|20180403|       118|            100.0|
|    null|   null|20180326|        50|            100.0|
|20180409|      4|    null|      null|            100.0|
|20180405|      7|    null|      null|            100.0|
+--------+-------+--------+----------+-----------------+

答案 2 :(得分:0)

我解决了混合使用两种建议的问题。

val dfKPI=d1.join(right=d2, usingColumns = Seq("cliente","fecha"), "outer").orderBy("fecha").withColumn( "KPI",round( when($"eventos".isNull, 100.0).otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),2))