pyspark两个数据帧subtractbykey问题

时间:2018-03-05 23:00:18

标签: python-3.x apache-spark pyspark comparison spark-dataframe

在尝试比较两个数据帧后,我只尝试使用标识有不同值的列输出数据帧。我发现难以找到继续进行的方法。

    **Code:**
df_a = sql_context.createDataFrame([("a", 3,"apple","bear","carrot"), ("b", 5,"orange","lion","cabbage"), ("c", 7,"pears","tiger","onion"),("c", 8,"jackfruit","elephant","raddish"),("c", 8,"watermelon","giraffe","tomato")], ["name", "id","fruit","animal","veggie"])
df_b = sql_context.createDataFrame([("a", 3,"apple","bear","carrot"), ("b", 5,"orange","lion","cabbage"), ("c", 7,"banana","tiger","onion"),("c", 8,"jackfruit","camel","raddish")], ["name", "id","fruit","animal","veggie"])
df_a = df_a.alias('df_a')
df_b = df_b.alias('df_b')
df = df_a.join(df_b, (df_a.id == df_b.id) & (df_a.name == df_b.name),'leftanti').select('df_a.*').show()

尝试根据dataframe1&之间的ID(id,name)进行匹配dataframe2

Dataframe 1:
+----+---+----------+--------+-------+
|name| id|     fruit|  animal| veggie|
+----+---+----------+--------+-------+
|   a|  3|     apple|    bear| carrot|
|   b|  5|    orange|    lion|cabbage|
|   c|  7|     pears|   tiger|  onion|
|   c|  8| jackfruit|elephant|raddish|
|   c|  9|watermelon| giraffe| tomato|
+----+---+----------+--------+-------+

Dataframe 2:
+----+---+---------+------+-------+
|name| id|    fruit|animal| veggie|
+----+---+---------+------+-------+
|   a|  3|    apple|  bear| carrot|
|   b|  5|   orange|  lion|cabbage|
|   c|  7|   banana| tiger|  onion|
|   c|  8|jackfruit| camel|raddish|
+----+---+---------+------+-------+



Expected dataframe
+----+---+----------+--------+
|name| id|     fruit|  animal|
+----+---+----------+--------+
|   c|  7|     pears|   tiger|
|   c|  8| jackfruit|elephant|
|   c|  9|watermelon| giraffe|
+----+---+----------+--------+

0 个答案:

没有答案