Question

我有两个文件文件A和文件B，文件A包含制表符分隔的列和与之关联的行：

person Id    person Name
56783    ram 12 > 4 matches intelligent
78954    rahim 45 >> 6 doesn't occur
56783    rahul 67 >> 6 will work for sure
78967    rajesh 78 >> 4 I dont know

文件B包含与标签分隔的列和与之关联的行：

person Id    person Name    city Name    country Name    salary
34526    paul 56 >> 78 has no idea    Tel Aviv    Isreal    60
56783    ram 12 > 4 matches intelligent    Seattle    USA    70
58783    ram 12 > 4 matches intelligent    Seattle    USA    90
39526    saul 96 >> 78 has no idea    Delhi    India    60
78954    rahim 45 >> 6 doesn't occur    Mumbai    India    90
98789    rahim 45 >> 6 doesn't occur    Mumbai    India    80
67526    delta 89 >> 78 has no idea    Tel Aviv    Isreal    50
56783    rahul 67 >> 6 will work for sure    Boston    USA    79
78783    rahul 67 >> 6 will work for sure    Boston    USA    79
39526    pallavi 56 >> 78 has no idea    Hyderabad    India    60
78967    rajesh 78 >> 4 I dont know    Hyderabad    India    78
08960    rajesh 78 >> 4 I dont know    Hyderabad    India    87

我想只将文件A中的记录放在文件B中，并删除其他重复记录。

例如，如果您注意到：

56783    ram 12 > 4 matches intelligent    Seattle    USA    70
58783    ram 12 > 4 matches intelligent    Seattle    USA    90

ram出现两次不同的id，我只想在文件B中使用相同的文件A ID并删除任何重复的内容：

例如我只想拥有：

56783    ram 12 > 4 matches intelligent    Seattle    USA    70

同样地，我想在文件B中只有rahim人ID，并删除或删除在两个文件中不匹配的其他rahim id。

另一个例子：

78954    rahim 45 >> 6 doesn't occur    Mumbai    India    90
98789    rahim 45 >> 6 doesn't occur    Mumbai    India    80

我只想要文件B中文件A的同一个人ID，因此fileB os rahim中唯一的记录是这个并删除其他记录：

78954    rahim 45 >> 6 doesn't occur    Mumbai    India    90

我可以使用任何编程语言，Java和Scala我更喜欢。最后，我不希望有重复的人名，id是不同的，所以我想在文件B中保存文件A的id并删除文件B中的其余部分，然后保存它。

我认为这可以很容易在Spark，尝试但没有运气!!

最终输出应为：

person Id    person Name    city Name    country Name    salary
    34526    paul 56 >> 78 has no idea    Tel Aviv    Isreal    60
    56783    ram 12 > 4 matches intelligent    Seattle    USA    70
    39526    saul 96 >> 78 has no idea    Delhi    India    60
    78954    rahim 45 >> 6 doesn't occur    Mumbai    India    90
    67526    delta 89 >> 78 has no idea    Tel Aviv    Isreal    50
    56783    rahul 67 >> 6 will work for sure    Boston    USA    79
    39526    pallavi 56 >> 78 has no idea    Hyderabad    India    60
    78967    rajesh 78 >> 4 I dont know    Hyderabad    India    78

Answer 1

我建议您使用dataFrame API，这样可以轻松完成任务使用sqlContext使用dataframe delimiter使用\t读取文件，即\t

但遗憾的是 delimiter delimiter无法正常工作。所以我将文件中的|更改为dataFrames并使用以下代码读入var df1 = sqlContext.read .format("com.databricks.spark.csv") .option("delimiter", "|") .option("header", "true") .load("location of file A")

您必须找到一种方法来将制表符分隔的文件A和B读入dataFrame或像我一样更改分隔符

+---------+--------------------------------+
|person Id|person Name                     |
+---------+--------------------------------+
|56783    |ram 12 > 4 matches intelligent  |
|78954    |rahim 45 >> 6 doesn't occur     |
|56783    |rahul 67 >> 6 will work for sure|
|78967    |rajesh 78 >> 4 I dont know      |
+---------+--------------------------------+

输出

file B

将dataframe的类似流程应用到df2 +---------+--------------------------------+---------+------------+------+ |person Id|person Name |city Name|country Name|salary| +---------+--------------------------------+---------+------------+------+ |34526 |paul 56 >> 78 has no idea |Tel Aviv |Isreal |60 | |56783 |ram 12 > 4 matches intelligent |Seattle |USA |70 | |58783 |ram 12 > 4 matches intelligent |Seattle |USA |90 | |39526 |saul 96 >> 78 has no idea |Delhi |India |60 | |78954 |rahim 45 >> 6 doesn't occur |Mumbai |India |90 | |98789 |rahim 45 >> 6 doesn't occur |Mumbai |India |80 | |67526 |delta 89 >> 78 has no idea |Tel Aviv |Isreal |50 | |56783 |rahul 67 >> 6 will work for sure|Boston |USA |79 | |78783 |rahul 67 >> 6 will work for sure|Boston |USA |79 | |39526 |pallavi 56 >> 78 has no idea |Hyderabad|India |60 | |78967 |rajesh 78 >> 4 I dont know |Hyderabad|India |78 | |08960 |rajesh 78 >> 4 I dont know |Hyderabad|India |87 | +---------+--------------------------------+---------+------------+------+中

column name

由于两个dataFrames的{{1}}匹配一个dataframe的名称需要renamed以后dropping用于

df1.withColumnRenamed("person Id", "Id").withColumnRenamed("person Name", "Name")

现在只剩下左边的步骤加入两个数据框，删除不必要的列并删除重复项

df1.join(df2, df1("Id") === df2("person Id")).drop("Id", "Name").dropDuplicates("person Id", "person Name")

最终输出

+---------+--------------------------------+---------+------------+------+
|person Id|person Name                     |city Name|country Name|salary|
+---------+--------------------------------+---------+------------+------+
|78954    |rahim 45 >> 6 doesn't occur     |Mumbai   |India       |90    |
|78967    |rajesh 78 >> 4 I dont know      |Hyderabad|India       |78    |
|56783    |ram 12 > 4 matches intelligent  |Seattle  |USA         |70    |
|56783    |rahul 67 >> 6 will work for sure|Boston   |USA         |79    |
+---------+--------------------------------+---------+------------+------+

根据我的理解，问题的所有混淆都得到了解答。

的被修改
以上输出为left join，因此最终输出似乎与所需输出不匹配

一个简单的right join和dropDuplicates中的更改应该具有所需的输出

df1.join(df2, df1("Id") === df2("person Id"), "right").drop("Id", "Name").dropDuplicates("person Name")

最终输出

+---------+--------------------------------+---------+------------+------+ |person Id|person Name |city Name|country Name|salary| +---------+--------------------------------+---------+------------+------+ |39526 |pallavi 56 >> 78 has no idea |Hyderabad|India |60 | |34526 |paul 56 >> 78 has no idea |Tel Aviv |Isreal |60 | |56783 |ram 12 > 4 matches intelligent |Seattle |USA |70 | |78954 |rahim 45 >> 6 doesn't occur |Mumbai |India |90 | |39526 |saul 96 >> 78 has no idea |Delhi |India |60 | |56783 |rahul 67 >> 6 will work for sure|Boston |USA |79 | |78967 |rajesh 78 >> 4 I dont know |Hyderabad|India |78 | |67526 |delta 89 >> 78 has no idea |Tel Aviv |Isreal |50 | +---------+--------------------------------+---------+------------+------+

使用fileA中的记录删除fileA的重复记录

1 个答案: