我想比较两个pyspark数据帧,并在新表中获取差异。
我测试了使用python的方式:
我的数据框
data = {'name': ['NO_VALUE', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, -999999, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df3 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df3
我的参考数据框
data_ref = {'name': ['Jaso', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 202, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df_ref3 = pd.DataFrame(data_ref, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df_ref3
比较行:
def compare_datasets(df, df_ref):
ne_stacked = (df != df_ref).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(df != df_ref)
changed_from = df.values[difference_locations]
changed_to = df_ref.values[difference_locations]
error_test = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
return error_test
compare_datasets(df3, df_ref3)
但是,我想在pyspark中做到这一点。有人有主意吗?
谢谢!
答案 0 :(得分:0)
可能很难准确地重现此行为。 我为您提供了部分解决方案:
df.show()
+----------+--------+-------+-------+
| index| name| year|reports|
+----------+--------+-------+-------+
| Cochice|NO_VALUE| 2012| 4|
| Pima| Molly|-999999| 24|
|Santa Cruz| Tina| 2013| 31|
| Maricopa| Jake| 2014| 2|
| Yuma| Amy| 2014| 3|
+----------+--------+-------+-------+
df_ref.show()
+----------+-----+----+-------+
| index| name|year|reports|
+----------+-----+----+-------+
| Cochice| Jaso|2012| 4|
| Pima|Molly|2012| 24|
|Santa Cruz| Tina|2013| 31|
| Maricopa| Jake|2014| 2|
| Yuma| Amy|2014| 3|
+----------+-----+----+-------+
df.subtract(df_ref).show()
+-------+--------+-------+-------+
| index| name| year|reports|
+-------+--------+-------+-------+
| Pima| Molly|-999999| 24|
|Cochice|NO_VALUE| 2012| 4|
+-------+--------+-------+-------+
或者您可以做慢一点:
for col in df_ref.columns:
out = df.select(col).subtract(df_ref.select(col))
if out.first():
print(out.collect())
[Row(name=u'NO_VALUE')]
[Row(year=-999999)]