高级连接两个数据框Spark Scala

时间:2018-12-16 16:18:41

标签: scala join apache-spark-sql apache-spark-dataset

我必须加入两个数据框。

样本: Dataframe1看起来像这样

df1_col1      df1_col2
   a            ex1
   b            ex4
   c            ex2
   d            ex6
   e            ex3

Dataframe2

df2_col1      df2_col2
   1           a,b,c
   2           d,c,e
   3           a,e,c

在结果数据框中,我想得到这样的结果

res_col1      res_col2       res_col3
    a           ex1             1
    a           ex1             3
    b           ex4             1
    c           ex2             1
    c           ex2             2
    c           ex2             3
    d           ex6             2
    e           ex3             2
    e           ex3             3

实现这种加入的最佳方法是什么?

3 个答案:

答案 0 :(得分:1)

我已经更新了以下代码

val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3")))
val df2 = sc.parallelize(Seq(List(("1","a,b,c"),("2","d,c,e")))).toDF
df2.withColumn("df2_col2_explode", explode(split($"_2", ","))).select($"_1".as("df2_col1"),$"df2_col2_explode").join(df1.select($"_1".as("df1_col1"),$"_2".as("df1_col2")), $"df1_col1"===$"df2_col2_explode","inner").show

您只需要分解这些值并通过分解将其生成然后再与其他数据框联接来生成多行。

您可以引用此链接How to split pipe-separated column into multiple rows?

答案 1 :(得分:1)

我将spark sql用于此连接,这是代码的一部分;

df1.createOrReplaceTempView("temp_v_df1")
df2.createOrReplaceTempView("temp_v_df2")
val df_result = spark.sql("""select 
                    |   b.df1_col1 as res_col1, 
                    |   b.df1_col2 as res_col2, 
                    |   a.df2_col1 as res_col3  
                    |   from (select df2_col1, exp_col 
                    |         from temp_v_df2 
                    |        lateral view explode(split(df2_col2,",")) dummy as exp_col) a
                    |   join temp_v_df1 b on a.exp_col = b.df1_col1""".stripMargin)

答案 2 :(得分:0)

我使用spark scala数据框来实现您的期望输出。

val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3"))).toDF("df1_col1","df1_col2") 

val df2 = sc.parallelize(Seq((1,("a,b,c")),(2,("d,c,e")),(3,("a,e,c")))).toDF("df2_col1","df2_col2") 

df2.withColumn("_tmp", explode(split($"df2_col2", "\\,"))).as("temp").join (df1,$"temp._tmp"===df1("df1_col1"),"inner").drop("_tmp","df2_col2").show
  

期望输出

+--------+--------+--------+
|df2_col1|df1_col1|df1_col2|
+--------+--------+--------+
|       2|       e|     ex3|
|       3|       e|     ex3|
|       2|       d|     ex6|
|       1|       c|     ex2|
|       2|       c|     ex2|
|       3|       c|     ex2|
|       1|       b|     ex4|
|       1|       a|     ex1|
|       3|       a|     ex1|
+--------+--------+--------+

根据您的要求重命名列。

以下是运行代码的屏幕截图

enter image description here 快乐的Hadooooooooooooooopooppppppppppppppppppppp