使用fileA中的记录删除fileA的重复记录

时间:2017-05-20 07:18:15

标签: apache-spark apache-spark-sql

我有两个文件文件A和文件B, 文件A包含制表符分隔的列和与之关联的行:

 
person Id    person Name
56783    ram 12 > 4 matches intelligent
78954    rahim 45 >> 6 doesn't occur
56783    rahul 67 >> 6 will work for sure
78967    rajesh 78 >> 4 I dont know

文件B包含与标签分隔的列和与之关联的行:

person Id    person Name    city Name    country Name    salary
34526    paul 56 >> 78 has no idea    Tel Aviv    Isreal    60
56783    ram 12 > 4 matches intelligent    Seattle    USA    70
58783    ram 12 > 4 matches intelligent    Seattle    USA    90
39526    saul 96 >> 78 has no idea    Delhi    India    60
78954    rahim 45 >> 6 doesn't occur    Mumbai    India    90
98789    rahim 45 >> 6 doesn't occur    Mumbai    India    80
67526    delta 89 >> 78 has no idea    Tel Aviv    Isreal    50
56783    rahul 67 >> 6 will work for sure    Boston    USA    79
78783    rahul 67 >> 6 will work for sure    Boston    USA    79
39526    pallavi 56 >> 78 has no idea    Hyderabad    India    60
78967    rajesh 78 >> 4 I dont know    Hyderabad    India    78
08960    rajesh 78 >> 4 I dont know    Hyderabad    India    87

我想只将文件A中的记录放在文件B中,并删除其他重复记录。

例如,如果您注意到:

56783    ram 12 > 4 matches intelligent    Seattle    USA    70
58783    ram 12 > 4 matches intelligent    Seattle    USA    90

ram出现两次不同的id,我只想在文件B中使用相同的文件A ID并删除任何重复的内容:

例如我只想拥有:

56783    ram 12 > 4 matches intelligent    Seattle    USA    70

同样地,我想在文件B中只有rahim人ID,并删除或删除在两个文件中不匹配的其他rahim id。

另一个例子:

78954    rahim 45 >> 6 doesn't occur    Mumbai    India    90
98789    rahim 45 >> 6 doesn't occur    Mumbai    India    80

我只想要文件B中文件A的同一个人ID,因此fileB os rahim中唯一的记录是这个并删除其他记录:

78954    rahim 45 >> 6 doesn't occur    Mumbai    India    90

我可以使用任何编程语言,Java和Scala我更喜欢。最后,我不希望有重复的人名,id是不同的,所以我想在文件B中保存文件A的id并删除文件B中的其余部分,然后保存它。

我认为这可以很容易在Spark,尝试但没有运气!!

最终输出应为:

person Id    person Name    city Name    country Name    salary
    34526    paul 56 >> 78 has no idea    Tel Aviv    Isreal    60
    56783    ram 12 > 4 matches intelligent    Seattle    USA    70
    39526    saul 96 >> 78 has no idea    Delhi    India    60
    78954    rahim 45 >> 6 doesn't occur    Mumbai    India    90
    67526    delta 89 >> 78 has no idea    Tel Aviv    Isreal    50
    56783    rahul 67 >> 6 will work for sure    Boston    USA    79
    39526    pallavi 56 >> 78 has no idea    Hyderabad    India    60
    78967    rajesh 78 >> 4 I dont know    Hyderabad    India    78

1 个答案:

答案 0 :(得分:0)

我建议您使用dataFrame API,这样可以轻松完成任务 使用sqlContext使用dataframe delimiter使用\t读取文件,即\t

但遗憾的是 delimiter delimiter无法正常工作。所以我将文件中的|更改为dataFrames并使用以下代码读入var df1 = sqlContext.read .format("com.databricks.spark.csv") .option("delimiter", "|") .option("header", "true") .load("location of file A")

您必须找到一种方法来将制表符分隔的文件A和B读入dataFrame或像我一样更改分隔符

+---------+--------------------------------+
|person Id|person Name                     |
+---------+--------------------------------+
|56783    |ram 12 > 4 matches intelligent  |
|78954    |rahim 45 >> 6 doesn't occur     |
|56783    |rahul 67 >> 6 will work for sure|
|78967    |rajesh 78 >> 4 I dont know      |
+---------+--------------------------------+

输出

file B

dataframe的类似流程应用到df2 +---------+--------------------------------+---------+------------+------+ |person Id|person Name |city Name|country Name|salary| +---------+--------------------------------+---------+------------+------+ |34526 |paul 56 >> 78 has no idea |Tel Aviv |Isreal |60 | |56783 |ram 12 > 4 matches intelligent |Seattle |USA |70 | |58783 |ram 12 > 4 matches intelligent |Seattle |USA |90 | |39526 |saul 96 >> 78 has no idea |Delhi |India |60 | |78954 |rahim 45 >> 6 doesn't occur |Mumbai |India |90 | |98789 |rahim 45 >> 6 doesn't occur |Mumbai |India |80 | |67526 |delta 89 >> 78 has no idea |Tel Aviv |Isreal |50 | |56783 |rahul 67 >> 6 will work for sure|Boston |USA |79 | |78783 |rahul 67 >> 6 will work for sure|Boston |USA |79 | |39526 |pallavi 56 >> 78 has no idea |Hyderabad|India |60 | |78967 |rajesh 78 >> 4 I dont know |Hyderabad|India |78 | |08960 |rajesh 78 >> 4 I dont know |Hyderabad|India |87 | +---------+--------------------------------+---------+------------+------+

column name

由于两个dataFrames的{​​{1}}匹配一个dataframe的名称需要renamed以后dropping用于

df1.withColumnRenamed("person Id", "Id").withColumnRenamed("person Name", "Name")

现在只剩下左边的步骤加入两个数据框,删除不必要的列并删除重复项

df1.join(df2, df1("Id") === df2("person Id")).drop("Id", "Name").dropDuplicates("person Id", "person Name")

最终输出

+---------+--------------------------------+---------+------------+------+
|person Id|person Name                     |city Name|country Name|salary|
+---------+--------------------------------+---------+------------+------+
|78954    |rahim 45 >> 6 doesn't occur     |Mumbai   |India       |90    |
|78967    |rajesh 78 >> 4 I dont know      |Hyderabad|India       |78    |
|56783    |ram 12 > 4 matches intelligent  |Seattle  |USA         |70    |
|56783    |rahul 67 >> 6 will work for sure|Boston   |USA         |79    |
+---------+--------------------------------+---------+------------+------+

根据我的理解,问题的所有混淆都得到了解答。

被修改
以上输出为left join,因此最终输出似乎与所需输出不匹配

一个简单的right joindropDuplicates中的更改应该具有所需的输出

df1.join(df2, df1("Id") === df2("person Id"), "right").drop("Id", "Name").dropDuplicates("person Name")

最终输出

+---------+--------------------------------+---------+------------+------+
|person Id|person Name                     |city Name|country Name|salary|
+---------+--------------------------------+---------+------------+------+
|39526    |pallavi 56 >> 78 has no idea    |Hyderabad|India       |60    |
|34526    |paul 56 >> 78 has no idea       |Tel Aviv |Isreal      |60    |
|56783    |ram 12 > 4 matches intelligent  |Seattle  |USA         |70    |
|78954    |rahim 45 >> 6 doesn't occur     |Mumbai   |India       |90    |
|39526    |saul 96 >> 78 has no idea       |Delhi    |India       |60    |
|56783    |rahul 67 >> 6 will work for sure|Boston   |USA         |79    |
|78967    |rajesh 78 >> 4 I dont know      |Hyderabad|India       |78    |
|67526    |delta 89 >> 78 has no idea      |Tel Aviv |Isreal      |50    |
+---------+--------------------------------+---------+------------+------+