SQL试图获取pyspark中的两个数据帧之间存在的不匹配,这在SQL中有点弱。 例如:
df1包含已婚的性别城市名称ID ...
df2包含已婚的性别城市名称ID ...
以ID,名称为主要列
我需要从列表中传递这些列
我想获取其中任何一个主列为空或两个都为空的行,以及以上指定列的任何数据不匹配的行。
我需要在另一个数据帧中获取这些不匹配的数据。
这是我尝试过的:
我列出了列的列表,然后将其传递给我仅从在线获取的此查询:
query_dataframe = spark.sql("select diff.column_names, t1.("+primary_coloumn+"), t2.("+expected_column+") from table1 t1 full outer join table2 t2 on (t2.("+primary_coloumn+") = t1.("+primary_coloumn+"))cross apply (select stuff((select ', ' + t.name as [text()] from (select '("+primary_coloumn+")' as name where t1.("+primary_coloumn+") is null or t2.("+primary_coloumn+") is null union all select '("+initial_column+")' where not t1.("+initial_column+") = t2.("+initial_column+") union all select '("+expected_column+")' where not ((t1.("+expected_column+") is null and t2.("+expected_column+") is null) or (t1.("+expected_column+") = t2.("+expected_column+")))) t for xml path(''), type).value('.','varchar(8000)'),1, 2, '') as column_names) diff where diff.column_names is not null")
如果可能,请也证明其他选择,否则任何建议都可以接受。 请帮忙!!! 预先感谢