Question

SQL试图获取pyspark中的两个数据帧之间存在的不匹配，这在SQL中有点弱。例如：

df1包含已婚的性别城市名称ID ...

df2包含已婚的性别城市名称ID ...

以ID，名称为主要列

我需要从列表中传递这些列

我想获取其中任何一个主列为空或两个都为空的行，以及以上指定列的任何数据不匹配的行。

我需要在另一个数据帧中获取这些不匹配的数据。

这是我尝试过的：

我列出了列的列表，然后将其传递给我仅从在线获取的此查询：

query_dataframe = spark.sql("select diff.column_names, t1.("+primary_coloumn+"), t2.("+expected_column+") from table1 t1 full outer join table2 t2 on (t2.("+primary_coloumn+") = t1.("+primary_coloumn+"))cross apply (select stuff((select ', ' + t.name as [text()] from (select '("+primary_coloumn+")' as name where t1.("+primary_coloumn+") is null or t2.("+primary_coloumn+") is null union all select '("+initial_column+")' where not t1.("+initial_column+") = t2.("+initial_column+") union all select '("+expected_column+")' where not ((t1.("+expected_column+") is null and t2.("+expected_column+") is null) or (t1.("+expected_column+") = t2.("+expected_column+")))) t for xml path(''), type).value('.','varchar(8000)'),1, 2, '') as column_names) diff where diff.column_names is not null")

如果可能，请也证明其他选择，否则任何建议都可以接受。请帮忙！！！预先感谢

比较相同名称列在pyspark中的两个数据框？

0 个答案: