按ID加入两个数据帧

时间:2017-01-17 22:35:42

标签: scala apache-spark

此问题与previous one有关。我在Scala中有两个数据帧:

df1 =

ID  start_date_time      field1    field2
1   2016-10-12 11:55:23  AAA       xxx1
2   2016-10-12 12:25:00  BBB       xxx2
3   2016-10-12 16:20:00  CCC       xxx3

df2 =

PK  start_date
1   2016-10-12
2   2016-10-14

如果以下条件失败,我需要向df1添加一个值为0的新列,否则 - > gt; 1:

If ID == PK and start_date_time refers to the same year, month and day as start_date.

结果应该是这个:

df1 =

ID  start_date_time      check  field1   field2
1   2016-10-12-11-55-23  1      AAA      xxx1
2   2016-10-12-12-25-00  0      BBB      xxx2
3   2016-10-12-16-20-00  0      CCC      xxx3

我使用这个解决方案:

import org.apache.spark.sql.functions.lit

val df1_date = df1.withColumn("date", to_date(df1("start_date_time")))
val df2_date = (df2.withColumn("date", to_date(df2("start_date"))).
                        withColumn("check", lit(1)).
                        select($"PK".as("ID"), $"date", $"check", $"field1", $"field2"))

df1_date.join(df2_date, Seq("ID", "date"), "left").drop($"date").na.fill(0).show

但是,是否可以不在df1中明确提及select($"PK".as("ID"), $"date", $"check", $"field1", $"field2"))中的所有列名? 是否可以做这样的事情?:select($"PK".as("ID"), $"date", $"check", *))

0 个答案:

没有答案