Spark:使用数组连接dataframe列

时间:2017-01-11 15:46:29

标签: join apache-spark

我有两个DataFrames,有两列

    带有架构df1

  • (key1:Long, Value) 带有架构df2

  • (key2:Array[Long], Value)

我需要在关键列上加入这些DataFrame(找到key1key2中的值之间的匹配值)。但问题是它们的类型不同。有没有办法做到这一点?

2 个答案:

答案 0 :(得分:2)

您可以强制转换 key1和key2的类型,然后使用 contains 函数,如下所示。

val df1 = sc.parallelize(Seq((1L,"one.df1"), 
                             (2L,"two.df1"),      
                             (3L,"three.df1"))).toDF("key1","Value")  

DF1:
+----+---------+
|key1|Value    |
+----+---------+
|1   |one.df1  |
|2   |two.df1  |
|3   |three.df1|
+----+---------+

val df2 = sc.parallelize(Seq((Array(1L,1L),"one.df2"),
                             (Array(2L,2L),"two.df2"),
                             (Array(3L,3L),"three.df2"))).toDF("key2","Value")
DF2:
+------+---------+
|key2  |Value    |
+------+---------+
|[1, 1]|one.df2  |
|[2, 2]|two.df2  |
|[3, 3]|three.df2|
+------+---------+

val joinedRDD = df1.join(df2, col("key2").cast("string").contains(col("key1").cast("string")))

JOIN:
+----+---------+------+---------+
|key1|Value    |key2  |Value    |
+----+---------+------+---------+
|1   |one.df1  |[1, 1]|one.df2  |
|2   |two.df1  |[2, 2]|two.df2  |
|3   |three.df1|[3, 3]|three.df2|
+----+---------+------+---------+

答案 1 :(得分:2)

做到这一点的最佳方法(并且不需要任何数据帧的转换或分解)是使用Event="DoAction" Value="Test3">1</Publish> spark sql表达式,如下所示。

array_contains

请注意,您不能直接使用import org.apache.spark.sql.functions.expr import spark.implicits._ val df1 = Seq((1L,"one.df1"), (2L,"two.df1"),(3L,"three.df1")).toDF("key1","Value") val df2 = Seq((Array(1L,1L),"one.df2"), (Array(2L,2L),"two.df2"), (Array(3L,3L),"three.df2")).toDF("key2","Value") val joinedRDD = df1.join(df2, expr("array_contains(key2, key1)")).show +----+---------+------+---------+ |key1| Value| key2| Value| +----+---------+------+---------+ | 1| one.df1|[1, 1]| one.df2| | 2| two.df1|[2, 2]| two.df2| | 3|three.df1|[3, 3]|three.df2| +----+---------+------+---------+ 函数,因为它要求第二个参数是文字而不是列表达式。