Question

具有以下两个数据框

first_df
 |-- company_id: string (nullable = true)
 |-- max_dd: date (nullable = true)
 |-- min_dd: date (nullable = true)
 |-- mean: double (nullable = true)
 |-- count: long (nullable = false)

second_df 
 |-- company_id: string (nullable = true)
 |-- max_dd: date (nullable = true)
 |-- mean: double (nullable = true)
 |-- count: long (nullable = false)

second_df中有一些公司数据。我需要从first_df中获取first_df中列出的公司ID的数据。

哪种火花API对我有用？我该怎么办？

谢谢。

问题扩展：

如果没有存储的记录，则first_df将为空。因此，first_df（“ mean”）＆first_df（“ count”）将为null，导致“ acc_new_mean”为null。在那种情况下，我需要将“ new_mean”设置为second_df（“ mean”），该怎么做？我这样尝试过，但是没有用任何线索如何在这里处理.withColumn（“ new_mean”，...）???

val acc_new_mean = (second_df("mean") + first_df("mean")) / (second_df("count") + first_df("count"))
    val acc_new_count  =  second_df("count") + first_df("count")


    val new_df = second_df.join(first_df.withColumnRenamed("company_id", "right_company_id").as("a"), 
                                 (  $"a.right_company_id"  === second_df("company_id") && ( second_df("min_dd")  > $"a.max_dd" ) ) 
                            , "leftOuter")
                            .withColumn("new_mean", if(acc_new_mean == null) lit(second_df("mean")) else  acc_new_mean )

Answer 1

方法1：

如果使用数据框的连接API很难连接2个数据框，则可以使用sql（如果您熟悉sql）。为此，您可以将2个数据帧作为表注册到Spark存储器中，并在其上写入sql。

second_df.registerTempTable("table_second_df")
first_df.registerTempTable("table_first_df")

val new_df = spark.sql("select distinct s.* from table_second_df s join table_first_df f on s.company_id=f.company_id")
new_df.show()

根据您的要求，我添加了逻辑。

考虑您的first_df如下所示：

+----------+----------+----------+----+-----+
|company_id|    max_dd|    min_dd|mean|count|
+----------+----------+----------+----+-----+
|         A|2019-04-05|2019-04-01|  10|  100|
|         A|2019-04-06|2019-04-02|  20|  200|
|         B|2019-04-08|2019-04-01|  30|  300|
|         B|2019-04-09|2019-04-02|  40|  400|
+----------+----------+----------+----+-----+

考虑您的second_df如下所示：

+----------+----------+----+-----+
|company_id|    max_dd|mean|count|
+----------+----------+----+-----+
|         A|2019-04-03|  10|  100|
|         A|2019-04-02|  20|  200|
+----------+----------+----+-----+

由于第二张表中有公司ID A，所以我从max_dd中获取了最新的second_df记录。对于公司ID B，它不在second_df中，因此我从max_dd获取了最新的first_df记录。

请在下面找到代码。

first_df.registerTempTable("table_first_df")
second_df.registerTempTable("table_second_df")
val new_df = spark.sql("select company_id,max_dd,min_dd,mean,count from (select distinct s.company_id,s.max_dd,null as min_dd,s.mean,s.count,row_number() over (partition by s.company_id order by s.max_dd desc) rno from table_second_df s join table_first_df f on s.company_id=f.company_id) where rno=1 union select company_id,max_dd,min_dd,mean,count from (select distinct f.*,row_number() over (partition by f.company_id order by f.max_dd desc) rno from table_first_df f left join table_second_df s  on s.company_id=f.company_id where s.company_id is null) where rno=1")
new_df.show()

以下是结果：

方法2：

您可以使用Approach 1 API的join来代替创建dataframe's中提到的临时表。这与Approach 1中的逻辑相同，但是在这里我使用dataframe's API来完成此任务。请不要忘记导入org.apache.spark.sql.expressions.Window，因为我在以下代码中使用了Window.patitionBy。

val new_df = second_df.as('s).join(first_df.as('f),$"s.company_id" === $"f.company_id","inner").drop($"min_dd").withColumn("min_dd",lit("")).select($"s.company_id", $"s.max_dd",$"min_dd", $"s.mean", $"s.count").dropDuplicates.withColumn("Rno", row_number().over(Window.partitionBy($"s.company_id").orderBy($"s.max_dd".desc))).filter($"Rno" === 1).drop($"Rno").union(first_df.as('f).join(second_df.as('s),$"s.company_id" === $"f.company_id","left_anti").select($"f.company_id", $"f.max_dd",$"f.min_dd", $"f.mean", $"f.count").dropDuplicates.withColumn("Rno", row_number().over(Window.partitionBy($"f.company_id").orderBy($"f.max_dd".desc))).filter($"Rno" === 1).drop($"Rno"))
new_df.show()

以下是结果：

如有任何疑问，请告诉我。

对于第一个数据帧中匹配的特定列值的所有值，如何获取第二个数据帧的数据？

2 个答案: