比较相同名称列在pyspark中的两个数据框?

时间:2018-07-06 10:58:18

标签: python mysql pandas apache-spark-sql pyspark-sql

SQL试图获取pyspark中的两个数据帧之间存在的不匹配,这在SQL中有点弱。 例如:

  

df1包含已婚的性别城市名称ID ...

     

df2包含已婚的性别城市名称ID ...

以ID,名称为主要列

我需要从列表中传递这些列

我想获取其中任何一个主列为空或两个都为空的行,以及以上指定列的任何数据不匹配的行。

我需要在另一个数据帧中获取这些不匹配的数据。

这是我尝试过的:

我列出了列的列表,然后将其传递给我仅从在线获取的此查询:

query_dataframe = spark.sql("select diff.column_names, t1.("+primary_coloumn+"), t2.("+expected_column+") from table1 t1 full outer join table2 t2 on (t2.("+primary_coloumn+") = t1.("+primary_coloumn+"))cross apply (select stuff((select ', ' + t.name as [text()] from (select '("+primary_coloumn+")' as name where t1.("+primary_coloumn+") is null or t2.("+primary_coloumn+") is null union all select '("+initial_column+")' where not t1.("+initial_column+") = t2.("+initial_column+") union all select '("+expected_column+")' where not ((t1.("+expected_column+") is null and t2.("+expected_column+") is null) or (t1.("+expected_column+") = t2.("+expected_column+")))) t for xml path(''), type).value('.','varchar(8000)'),1, 2, '') as column_names) diff where diff.column_names is not null")

如果可能,请也证明其他选择,否则任何建议都可以接受。 请帮忙!!! 预先感谢

0 个答案:

没有答案