对于第一个数据帧中匹配的特定列值的所有值,如何获取第二个数据帧的数据?

时间:2019-04-05 11:30:54

标签: scala apache-spark apache-spark-sql databricks

具有以下两个数据框

first_df
 |-- company_id: string (nullable = true)
 |-- max_dd: date (nullable = true)
 |-- min_dd: date (nullable = true)
 |-- mean: double (nullable = true)
 |-- count: long (nullable = false)

second_df 
 |-- company_id: string (nullable = true)
 |-- max_dd: date (nullable = true)
 |-- mean: double (nullable = true)
 |-- count: long (nullable = false)

second_df中有一些公司数据。我需要从first_df中获取first_df中列出的公司ID的数据。

哪种火花API对我有用? 我该怎么办?

谢谢。

问题扩展:

如果没有存储的记录,则first_df将为空。因此,first_df(“ mean”)&first_df(“ count”)将为null,导致“ acc_new_mean”为null。在那种情况下,我需要将“ new_mean”设置为second_df(“ mean”),该怎么做? 我这样尝试过,但是没有用 任何线索如何在这里处理.withColumn(“ new_mean”,...)???

val acc_new_mean = (second_df("mean") + first_df("mean")) / (second_df("count") + first_df("count"))
    val acc_new_count  =  second_df("count") + first_df("count")


    val new_df = second_df.join(first_df.withColumnRenamed("company_id", "right_company_id").as("a"), 
                                 (  $"a.right_company_id"  === second_df("company_id") && ( second_df("min_dd")  > $"a.max_dd" ) ) 
                            , "leftOuter")
                            .withColumn("new_mean", if(acc_new_mean == null) lit(second_df("mean")) else  acc_new_mean )

2 个答案:

答案 0 :(得分:1)

方法1:

如果使用数据框的连接API很难连接2个数据框,则可以使用sql(如果您熟悉sql)。为此,您可以将2个数据帧作为表注册到Spark存储器中,并在其上写入sql。

second_df.registerTempTable("table_second_df")
first_df.registerTempTable("table_first_df")

val new_df = spark.sql("select distinct s.* from table_second_df s join table_first_df f on s.company_id=f.company_id")
new_df.show()

根据您的要求,我添加了逻辑。

考虑您的first_df如下所示:

+----------+----------+----------+----+-----+
|company_id|    max_dd|    min_dd|mean|count|
+----------+----------+----------+----+-----+
|         A|2019-04-05|2019-04-01|  10|  100|
|         A|2019-04-06|2019-04-02|  20|  200|
|         B|2019-04-08|2019-04-01|  30|  300|
|         B|2019-04-09|2019-04-02|  40|  400|
+----------+----------+----------+----+-----+

考虑您的second_df如下所示:

+----------+----------+----+-----+
|company_id|    max_dd|mean|count|
+----------+----------+----+-----+
|         A|2019-04-03|  10|  100|
|         A|2019-04-02|  20|  200|
+----------+----------+----+-----+

由于第二张表中有公司ID A,所以我从max_dd中获取了最新的second_df记录。对于公司ID B,它不在second_df中,因此我从max_dd获取了最新的first_df记录。

请在下面找到代码。

first_df.registerTempTable("table_first_df")
second_df.registerTempTable("table_second_df")
val new_df = spark.sql("select company_id,max_dd,min_dd,mean,count from (select distinct s.company_id,s.max_dd,null as min_dd,s.mean,s.count,row_number() over (partition by s.company_id order by s.max_dd desc) rno from table_second_df s join table_first_df f on s.company_id=f.company_id) where rno=1 union select company_id,max_dd,min_dd,mean,count from (select distinct f.*,row_number() over (partition by f.company_id order by f.max_dd desc) rno from table_first_df f left join table_second_df s  on s.company_id=f.company_id where s.company_id is null) where rno=1")
new_df.show()

以下是结果:

enter image description here

方法2:

您可以使用Approach 1 API的join来代替创建dataframe's中提到的临时表。这与Approach 1中的逻辑相同,但是在这里我使用dataframe's API来完成此任务。请不要忘记导入org.apache.spark.sql.expressions.Window,因为我在以下代码中使用了Window.patitionBy

val new_df = second_df.as('s).join(first_df.as('f),$"s.company_id" === $"f.company_id","inner").drop($"min_dd").withColumn("min_dd",lit("")).select($"s.company_id", $"s.max_dd",$"min_dd", $"s.mean", $"s.count").dropDuplicates.withColumn("Rno", row_number().over(Window.partitionBy($"s.company_id").orderBy($"s.max_dd".desc))).filter($"Rno" === 1).drop($"Rno").union(first_df.as('f).join(second_df.as('s),$"s.company_id" === $"f.company_id","left_anti").select($"f.company_id", $"f.max_dd",$"f.min_dd", $"f.mean", $"f.count").dropDuplicates.withColumn("Rno", row_number().over(Window.partitionBy($"f.company_id").orderBy($"f.max_dd".desc))).filter($"Rno" === 1).drop($"Rno"))
new_df.show()

以下是结果:

enter image description here

如有任何疑问,请告诉我。

答案 1 :(得分:0)

main