将功能从python转换为pyspark

时间:2018-11-23 09:57:34

标签: python pyspark pyspark-sql

我想比较两个pyspark数据帧,并在新表中获取差异。

我测试了使用python的方式:

我的数据框

data = {'name': ['NO_VALUE', 'Molly', 'Tina', 'Jake', 'Amy'],
    'year': [2012, -999999, 2013, 2014, 2014],
    'reports': [4, 24, 31, 2, 3]}
df3 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df3

我的参考数据框

data_ref = {'name': ['Jaso', 'Molly', 'Tina', 'Jake', 'Amy'],
    'year': [2012, 202, 2013, 2014, 2014],
    'reports': [4, 24, 31, 2, 3]}
df_ref3 = pd.DataFrame(data_ref, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df_ref3

比较行:

def compare_datasets(df, df_ref):
    ne_stacked = (df != df_ref).stack()
    changed = ne_stacked[ne_stacked]
    changed.index.names = ['id', 'col']
    difference_locations = np.where(df != df_ref)
    changed_from = df.values[difference_locations]
    changed_to = df_ref.values[difference_locations]
    error_test = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
    return error_test

compare_datasets(df3, df_ref3)

但是,我想在pyspark中做到这一点。有人有主意吗?

谢谢!

1 个答案:

答案 0 :(得分:0)

可能很难准确地重现此行为。 我为您提供了部分解决方案:

df.show()
+----------+--------+-------+-------+
|     index|    name|   year|reports|
+----------+--------+-------+-------+
|   Cochice|NO_VALUE|   2012|      4|
|      Pima|   Molly|-999999|     24|
|Santa Cruz|    Tina|   2013|     31|
|  Maricopa|    Jake|   2014|      2|
|      Yuma|     Amy|   2014|      3|
+----------+--------+-------+-------+

df_ref.show()
+----------+-----+----+-------+
|     index| name|year|reports|
+----------+-----+----+-------+
|   Cochice| Jaso|2012|      4|
|      Pima|Molly|2012|     24|
|Santa Cruz| Tina|2013|     31|
|  Maricopa| Jake|2014|      2|
|      Yuma|  Amy|2014|      3|
+----------+-----+----+-------+

df.subtract(df_ref).show()
+-------+--------+-------+-------+
|  index|    name|   year|reports|
+-------+--------+-------+-------+
|   Pima|   Molly|-999999|     24|
|Cochice|NO_VALUE|   2012|      4|
+-------+--------+-------+-------+

或者您可以做慢一点:

for col in df_ref.columns:
  out = df.select(col).subtract(df_ref.select(col))
  if out.first():
    print(out.collect())

[Row(name=u'NO_VALUE')]
[Row(year=-999999)]