如何在pyspark数据帧上执行联接操作?

时间:2019-05-21 05:17:43

标签: dataframe pyspark apache-spark-sql

我有两个数据帧dd1和dd2,我想加入这些数据帧。

dd1:

id name
 1  red
 2  green
 3  yellow
 4  black
 5  pink
 6  blue
 7  white
 8  grey

dd2:-

  id  name1
   1  banana
   2  apple
   4  orange
   8  grapes
   9  leamon

并且我想在dd1数据帧中输出如下内容:

id name     name1
 1  red     banana
 2  green   apple
 3  yellow  NULL
 4  black   orange
 5  pink    NULL 
 6  blue    NULL
 7  white   NULL
 8  grey    grapes

1 个答案:

答案 0 :(得分:0)

您可以尝试以下代码:

df = spark.createDataFrame(
    [(1,'red'),(2,'green'),(3,'yellow'),(4,'black'),(5,'pink'),
    (6,'blue'),(7,'white'),(8,'grey')], ["id", "name"])

+---+------+
| id|  name|
+---+------+
|  1|   red|
|  2| green|
|  3|yellow|
|  4| black|
|  5|  pink|
|  6|  blue|
|  7| white|
|  8|  grey|
+---+------+

df1 = spark.createDataFrame(
    [(1,'banana'),(2,'apple'),(4,'orange'),(8,'grapes'),(9,'leamon')], ["id1", "name1"])

+---+------+
|id1| name1|
+---+------+
|  1|banana|
|  2| apple|
|  4|orange|
|  8|grapes|
|  9|leamon|
+---+------+

condition = [df.id ==df1.id1]
inner_join=df.join(df1,condition,how='left')

inner_join=inner_join.drop("id1")
inner_join=inner_join.orderBy("id")

display(inner_join) 

+---+------+------+
| id|  name| name1|
+---+------+------+
|  1|   red|banana|
|  2| green| apple|
|  3|yellow|  null|
|  4| black|orange|
|  5|  pink|  null|
|  6|  blue|  null|
|  7| white|  null|
|  8|  grey|grapes|
+---+------+------+