使用Spark数据框进行文本文件比较

时间:2018-10-08 05:59:07

标签: pyspark apache-spark-sql

我想使用Spark数据框实现以下要求,以比较2个文本/ csv

  • 列表项

个文件。理想情况下,File1.txt应该与File2.txt进行比较,并且结果应位于标记为(SAME / UPDATE / INSERT / DELETE)的其他txt文件中。

UPDATE-与file1比较时,如果file2中有任何记录值被更新 INSERT-如果file2中存在新记录 删除-仅当记录存在于文件1中(不存在于文件2中) 相同-如果两个文件中都存在相同的记录

Image

到目前为止,我没有进行下面的编码。但无法进一步进行。请帮助。

PictureBox

在执行leftJoin,rightJoin,innerJoin之后,是否有任何方法可以合并数据。这样我是否可以获得期望的输出或任何其他方式。

谢谢

2 个答案:

答案 0 :(得分:1)

您可以在下面找到我的解决方案。我为SAME / UPDATE / INSERT / DELETE案例创建了4个数据框,然后将它们合并在一起

>>> from functools import reduce
>>> from pyspark.sql import DataFrame
>>> import pyspark.sql.functions as F

>>> df1 = sc.parallelize([
...     (1,'IT','RAM',1000),    
...     (2,'IT','SRI',600),
...     (3,'HR','GOPI',1500),    
...     (5,'HW','MAHI',700)
...     ]).toDF(['NO','DEPT','NAME','SAL'])
>>> df1.show()
+---+----+----+----+
| NO|DEPT|NAME| SAL|
+---+----+----+----+
|  1|  IT| RAM|1000|
|  2|  IT| SRI| 600|
|  3|  HR|GOPI|1500|
|  5|  HW|MAHI| 700|
+---+----+----+----+

>>> df2 = sc.parallelize([
...     (1,'IT','RAM',1000),    
...     (2,'IT','SRI',900),
...     (4,'MT','SUMP',1200),    
...     (5,'HW','MAHI',700)
...     ]).toDF(['NO','DEPT','NAME','SAL'])
>>> df2.show()
+---+----+----+----+
| NO|DEPT|NAME| SAL|
+---+----+----+----+
|  1|  IT| RAM|1000|
|  2|  IT| SRI| 900|
|  4|  MT|SUMP|1200|
|  5|  HW|MAHI| 700|
+---+----+----+----+

#DELETE
>>> df_d = df1.join(df2, df1.NO == df2.NO, 'left').filter(F.isnull(df2.NO)).select(df1.NO,df1.DEPT,df1.NAME,df1.SAL, F.lit('D').alias('FLAG'))
#INSERT
>>> df_i = df1.join(df2, df1.NO == df2.NO, 'right').filter(F.isnull(df1.NO)).select(df2.NO,df2.DEPT,df2.NAME,df2.SAL, F.lit('I').alias('FLAG'))
#SAME/
>>> df_s = df1.join(df2, df1.NO == df2.NO, 'inner').filter(F.concat(df2.NO,df2.DEPT,df2.NAME,df2.SAL) == F.concat(df1.NO,df1.DEPT,df1.NAME,df1.SAL)).\
...     select(df1.NO,df1.DEPT,df1.NAME,df1.SAL, F.lit('S').alias('FLAG'))
#UPDATE
>>> df_u = df1.join(df2, df1.NO == df2.NO, 'inner').filter(F.concat(df2.NO,df2.DEPT,df2.NAME,df2.SAL) != F.concat(df1.NO,df1.DEPT,df1.NAME,df1.SAL)).\
...     select(df2.NO,df2.DEPT,df2.NAME,df2.SAL, F.lit('U').alias('FLAG'))


>>> dfs = [df_s,df_u,df_u,df_i]
>>> df = reduce(DataFrame.unionAll, dfs)
>>> 
>>> df.show()
+---+----+----+----+----+                                                       
| NO|DEPT|NAME| SAL|FLAG|
+---+----+----+----+----+
|  5|  HW|MAHI| 700|   S|
|  1|  IT| RAM|1000|   S|
|  2|  IT| SRI| 900|   U|
|  2|  IT| SRI| 900|   U|
|  4|  MT|SUMP|1200|   I|
+---+----+----+----+----+

答案 1 :(得分:0)

首先连接所有列后,可以使用'outer'连接。然后为标记创建udf

import pyspark.sql.functions as F

df = sql.createDataFrame([
     (1,'IT','RAM',1000),
     (2,'IT','SRI',600),
     (3,'HR','GOPI',1500),
     (5,'HW','MAHI',700)],
     ['NO'  ,'DEPT', 'NAME',   'SAL' ])

df1 = sql.createDataFrame([
     (1,'IT','RAM',1000),
     (2,'IT','SRI',900),
     (4,'MT','SUMP',1200 ),
     (5,'HW','MAHI',700)],
     ['NO'  ,'DEPT', 'NAME',   'SAL' ])

def flags(x,y):
    if not x:
        return y+'-I'
    if not y:
        return x+'-D'
    if x == y:
        return x+'-S'
    return y+'-U'

_cols = df.columns
flag_udf = F.udf(lambda x,y: flags(x,y),StringType())   


df = df.select(['NO']+ [F.concat_ws('-', *[F.col(_c) for _c in df.columns]).alias('f1')])\
        .join(df1.select(['NO']+ [F.concat_ws('-', *[F.col(_c1) for _c1 in df1.columns]).alias('f2')]), 'NO', 'outer')\
        .select(flag_udf('f1','f2').alias('combined'))
df.show()

结果将是

+----------------+                                                              
|        combined|
+----------------+
| 5-HW-MAHI-700-S|
| 1-IT-RAM-1000-S|
|3-HR-GOPI-1500-D|
|  2-IT-SRI-900-U|
|4-MT-SUMP-1200-I|
+----------------+

最后,拆分combined列。

split_col = F.split(df['combined'], '-')
df = df.select([split_col.getItem(i).alias(s) for i,s in enumerate(_cols+['FLAG'])])

df.show()

您将获得所需的输出,

+---+----+----+----+----+                                                       
| NO|DEPT|NAME| SAL|FLAG|
+---+----+----+----+----+
|  5|  HW|MAHI| 700|   S|
|  1|  IT| RAM|1000|   S|
|  3|  HR|GOPI|1500|   D|
|  2|  IT| SRI| 900|   U|
|  4|  MT|SUMP|1200|   I|
+---+----+----+----+----+