具有以下两个数据框
first_df
|-- company_id: string (nullable = true)
|-- max_dd: date (nullable = true)
|-- min_dd: date (nullable = true)
|-- mean: double (nullable = true)
|-- count: long (nullable = false)
second_df
|-- company_id: string (nullable = true)
|-- max_dd: date (nullable = true)
|-- mean: double (nullable = true)
|-- count: long (nullable = false)
second_df中有一些公司数据。我需要从first_df中获取first_df中列出的公司ID的数据。
哪种火花API对我有用? 我该怎么办?
谢谢。
问题扩展:
如果没有存储的记录,则first_df将为空。因此,first_df(“ mean”)&first_df(“ count”)将为null,导致“ acc_new_mean”为null。在那种情况下,我需要将“ new_mean”设置为second_df(“ mean”),该怎么做? 我这样尝试过,但是没有用 任何线索如何在这里处理.withColumn(“ new_mean”,...)???
val acc_new_mean = (second_df("mean") + first_df("mean")) / (second_df("count") + first_df("count"))
val acc_new_count = second_df("count") + first_df("count")
val new_df = second_df.join(first_df.withColumnRenamed("company_id", "right_company_id").as("a"),
( $"a.right_company_id" === second_df("company_id") && ( second_df("min_dd") > $"a.max_dd" ) )
, "leftOuter")
.withColumn("new_mean", if(acc_new_mean == null) lit(second_df("mean")) else acc_new_mean )
答案 0 :(得分:1)
方法1:
如果使用数据框的连接API很难连接2个数据框,则可以使用sql(如果您熟悉sql)。为此,您可以将2个数据帧作为表注册到Spark存储器中,并在其上写入sql。
second_df.registerTempTable("table_second_df")
first_df.registerTempTable("table_first_df")
val new_df = spark.sql("select distinct s.* from table_second_df s join table_first_df f on s.company_id=f.company_id")
new_df.show()
根据您的要求,我添加了逻辑。
考虑您的first_df
如下所示:
+----------+----------+----------+----+-----+
|company_id| max_dd| min_dd|mean|count|
+----------+----------+----------+----+-----+
| A|2019-04-05|2019-04-01| 10| 100|
| A|2019-04-06|2019-04-02| 20| 200|
| B|2019-04-08|2019-04-01| 30| 300|
| B|2019-04-09|2019-04-02| 40| 400|
+----------+----------+----------+----+-----+
考虑您的second_df
如下所示:
+----------+----------+----+-----+
|company_id| max_dd|mean|count|
+----------+----------+----+-----+
| A|2019-04-03| 10| 100|
| A|2019-04-02| 20| 200|
+----------+----------+----+-----+
由于第二张表中有公司ID A
,所以我从max_dd
中获取了最新的second_df
记录。对于公司ID B
,它不在second_df
中,因此我从max_dd
获取了最新的first_df
记录。
请在下面找到代码。
first_df.registerTempTable("table_first_df")
second_df.registerTempTable("table_second_df")
val new_df = spark.sql("select company_id,max_dd,min_dd,mean,count from (select distinct s.company_id,s.max_dd,null as min_dd,s.mean,s.count,row_number() over (partition by s.company_id order by s.max_dd desc) rno from table_second_df s join table_first_df f on s.company_id=f.company_id) where rno=1 union select company_id,max_dd,min_dd,mean,count from (select distinct f.*,row_number() over (partition by f.company_id order by f.max_dd desc) rno from table_first_df f left join table_second_df s on s.company_id=f.company_id where s.company_id is null) where rno=1")
new_df.show()
以下是结果:
方法2:
您可以使用Approach 1
API的join
来代替创建dataframe's
中提到的临时表。这与Approach 1
中的逻辑相同,但是在这里我使用dataframe's
API来完成此任务。请不要忘记导入org.apache.spark.sql.expressions.Window
,因为我在以下代码中使用了Window.patitionBy
。
val new_df = second_df.as('s).join(first_df.as('f),$"s.company_id" === $"f.company_id","inner").drop($"min_dd").withColumn("min_dd",lit("")).select($"s.company_id", $"s.max_dd",$"min_dd", $"s.mean", $"s.count").dropDuplicates.withColumn("Rno", row_number().over(Window.partitionBy($"s.company_id").orderBy($"s.max_dd".desc))).filter($"Rno" === 1).drop($"Rno").union(first_df.as('f).join(second_df.as('s),$"s.company_id" === $"f.company_id","left_anti").select($"f.company_id", $"f.max_dd",$"f.min_dd", $"f.mean", $"f.count").dropDuplicates.withColumn("Rno", row_number().over(Window.partitionBy($"f.company_id").orderBy($"f.max_dd".desc))).filter($"Rno" === 1).drop($"Rno"))
new_df.show()
以下是结果:
如有任何疑问,请告诉我。
答案 1 :(得分:0)
main