如何在pyspark中过滤一个数据帧与另一个数据帧(例如“从X.colx所在位置的X选择*”(从Y选择余数)”)?

时间:2019-12-21 00:10:15

标签: pyspark pyspark-sql

尝试根据另一个数据框的列过滤pyspark数据框,例如 我有一些tsv文件,例如...

test11.txt.gz

name    id
a       1234
b       5678
c       7890


test12.txt.gz

name    id
a       1234
f       1010
c       7890

并尝试根据其他test12数据框过滤test11数据框,以获得类似...

name    id
a       1234
c       7890

当前尝试使用类似...的代码

>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.functions import *

>>> sparkSession = SparkSession.builder.appName("data_debugging").getOrCreate()

# reading a parquet or tsv
>>> df1 = sparkSession.read.option("header", "true").option("sep", "true").csv("hdfs://hw001.co.local/tmp/test11.tsv.gz")
>>> df2 = sparkSession.read.option("header", "true").option("sep", "true").csv("hdfs://hw001.co.local/tmp/test12.tsv.gz")


# first trying like this
>>> df1[df1["name"].isin(df2["name"])]
...
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Resolved attribute(s) name#33 missing from name#10,id#11 in operator !Filter name#10 IN (name#33). Attribute(s) with the same name appear in the operation: name. Please check if the right attribute(s) are used.;;\n!Filter name#10 IN (name#33)\n+- Relation[name#10,id#11] csv\n'


# then even with changing the column name for one of the dataframes like...
>>> df1[df1["name"].isin(df2.withColumnRenamed("name", "__FILTER")["__FILTER"])]
....
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Resolved attribute(s) __FILTER#52 missing from name#10,id#11 in operator !Filter name#10 IN (__FILTER#52).;;\n!Filter name#10 IN (__FILTER#52)\n+- Relation[name#10,id#11] csv\n'

因此,在这一点上,不确定该怎么做,也没有足够的pyspark经验来解释产生的错误消息。有人知道该怎么做吗?


要注意的一件事是,在我的实际用例中,表可能具有许多其他不同的列(它们在每个表中可能具有相同的名称但含义不同),因此不确定在此处进行连接是否正确,因为仅想要要过滤的表的列(以及加入时可能发生名称冲突)。例如。像...一样简单地加入。

>>> df1 = sparkSession.createDataFrame([(123, "bob"), (456, "larry"), (789, "jeff")], ["id", "name"])
+---+-----+
| id| name|
+---+-----+
|123|  bob|
|456|larry|
|789| jeff|
+---+-----+
>>> df2 = sparkSession.createDataFrame([(123, "mgmt", "europe"), (6789, "sales", "asia"), (789, "logistics", "USA")], ["id", "name", "docs"])
+----+---------+------+
|  id|     name|  docs|
+----+---------+------+
| 123|     mgmt|europe|
|6789|    sales|  asia|
| 789|logistics|   USA|
+----+---------+------+

# gets us...
>>> df1.join(df2, "id", "inner").show()
+---+----+---------+------+
| id|name|     name|  docs|
+---+----+---------+------+
|789|jeff|logistics|   USA|
|123| bob|     mgmt|europe|
+---+----+---------+------+

“名称”列名称冲突的地方。

0 个答案:

没有答案