我有两个文件文件A和文件B, 文件A包含制表符分隔的列和与之关联的行:
person Id person Name
56783 ram 12 > 4 matches intelligent
78954 rahim 45 >> 6 doesn't occur
56783 rahul 67 >> 6 will work for sure
78967 rajesh 78 >> 4 I dont know
文件B包含与标签分隔的列和与之关联的行:
person Id person Name city Name country Name salary
34526 paul 56 >> 78 has no idea Tel Aviv Isreal 60
56783 ram 12 > 4 matches intelligent Seattle USA 70
58783 ram 12 > 4 matches intelligent Seattle USA 90
39526 saul 96 >> 78 has no idea Delhi India 60
78954 rahim 45 >> 6 doesn't occur Mumbai India 90
98789 rahim 45 >> 6 doesn't occur Mumbai India 80
67526 delta 89 >> 78 has no idea Tel Aviv Isreal 50
56783 rahul 67 >> 6 will work for sure Boston USA 79
78783 rahul 67 >> 6 will work for sure Boston USA 79
39526 pallavi 56 >> 78 has no idea Hyderabad India 60
78967 rajesh 78 >> 4 I dont know Hyderabad India 78
08960 rajesh 78 >> 4 I dont know Hyderabad India 87
我想只将文件A中的记录放在文件B中,并删除其他重复记录。
例如,如果您注意到:
56783 ram 12 > 4 matches intelligent Seattle USA 70
58783 ram 12 > 4 matches intelligent Seattle USA 90
ram出现两次不同的id,我只想在文件B中使用相同的文件A ID并删除任何重复的内容:
例如我只想拥有:
56783 ram 12 > 4 matches intelligent Seattle USA 70
同样地,我想在文件B中只有rahim人ID,并删除或删除在两个文件中不匹配的其他rahim id。
另一个例子:
78954 rahim 45 >> 6 doesn't occur Mumbai India 90
98789 rahim 45 >> 6 doesn't occur Mumbai India 80
我只想要文件B中文件A的同一个人ID,因此fileB os rahim中唯一的记录是这个并删除其他记录:
78954 rahim 45 >> 6 doesn't occur Mumbai India 90
我可以使用任何编程语言,Java和Scala我更喜欢。最后,我不希望有重复的人名,id是不同的,所以我想在文件B中保存文件A的id并删除文件B中的其余部分,然后保存它。
我认为这可以很容易在Spark,尝试但没有运气!!
最终输出应为:
person Id person Name city Name country Name salary
34526 paul 56 >> 78 has no idea Tel Aviv Isreal 60
56783 ram 12 > 4 matches intelligent Seattle USA 70
39526 saul 96 >> 78 has no idea Delhi India 60
78954 rahim 45 >> 6 doesn't occur Mumbai India 90
67526 delta 89 >> 78 has no idea Tel Aviv Isreal 50
56783 rahul 67 >> 6 will work for sure Boston USA 79
39526 pallavi 56 >> 78 has no idea Hyderabad India 60
78967 rajesh 78 >> 4 I dont know Hyderabad India 78
答案 0 :(得分:0)
我建议您使用dataFrame
API
,这样可以轻松完成任务
使用sqlContext
使用dataframe
delimiter
使用\t
读取文件,即\t
但遗憾的是 delimiter
delimiter
无法正常工作。所以我将文件中的|
更改为dataFrames
并使用以下代码读入var df1 = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.option("header", "true")
.load("location of file A")
您必须找到一种方法来将制表符分隔的文件A和B读入dataFrame或像我一样更改分隔符
+---------+--------------------------------+
|person Id|person Name |
+---------+--------------------------------+
|56783 |ram 12 > 4 matches intelligent |
|78954 |rahim 45 >> 6 doesn't occur |
|56783 |rahul 67 >> 6 will work for sure|
|78967 |rajesh 78 >> 4 I dont know |
+---------+--------------------------------+
输出
file B
将dataframe
的类似流程应用到df2
+---------+--------------------------------+---------+------------+------+
|person Id|person Name |city Name|country Name|salary|
+---------+--------------------------------+---------+------------+------+
|34526 |paul 56 >> 78 has no idea |Tel Aviv |Isreal |60 |
|56783 |ram 12 > 4 matches intelligent |Seattle |USA |70 |
|58783 |ram 12 > 4 matches intelligent |Seattle |USA |90 |
|39526 |saul 96 >> 78 has no idea |Delhi |India |60 |
|78954 |rahim 45 >> 6 doesn't occur |Mumbai |India |90 |
|98789 |rahim 45 >> 6 doesn't occur |Mumbai |India |80 |
|67526 |delta 89 >> 78 has no idea |Tel Aviv |Isreal |50 |
|56783 |rahul 67 >> 6 will work for sure|Boston |USA |79 |
|78783 |rahul 67 >> 6 will work for sure|Boston |USA |79 |
|39526 |pallavi 56 >> 78 has no idea |Hyderabad|India |60 |
|78967 |rajesh 78 >> 4 I dont know |Hyderabad|India |78 |
|08960 |rajesh 78 >> 4 I dont know |Hyderabad|India |87 |
+---------+--------------------------------+---------+------------+------+
中
column name
由于两个dataFrames
的{{1}}匹配一个dataframe
的名称需要renamed
以后dropping
用于
df1.withColumnRenamed("person Id", "Id").withColumnRenamed("person Name", "Name")
现在只剩下左边的步骤加入两个数据框,删除不必要的列并删除重复项
df1.join(df2, df1("Id") === df2("person Id")).drop("Id", "Name").dropDuplicates("person Id", "person Name")
最终输出
+---------+--------------------------------+---------+------------+------+
|person Id|person Name |city Name|country Name|salary|
+---------+--------------------------------+---------+------------+------+
|78954 |rahim 45 >> 6 doesn't occur |Mumbai |India |90 |
|78967 |rajesh 78 >> 4 I dont know |Hyderabad|India |78 |
|56783 |ram 12 > 4 matches intelligent |Seattle |USA |70 |
|56783 |rahul 67 >> 6 will work for sure|Boston |USA |79 |
+---------+--------------------------------+---------+------------+------+
根据我的理解,问题的所有混淆都得到了解答。
的被修改强>
以上输出为left join
,因此最终输出似乎与所需输出不匹配
一个简单的right join
和dropDuplicates
中的更改应该具有所需的输出
df1.join(df2, df1("Id") === df2("person Id"), "right").drop("Id", "Name").dropDuplicates("person Name")
最终输出
+---------+--------------------------------+---------+------------+------+
|person Id|person Name |city Name|country Name|salary|
+---------+--------------------------------+---------+------------+------+
|39526 |pallavi 56 >> 78 has no idea |Hyderabad|India |60 |
|34526 |paul 56 >> 78 has no idea |Tel Aviv |Isreal |60 |
|56783 |ram 12 > 4 matches intelligent |Seattle |USA |70 |
|78954 |rahim 45 >> 6 doesn't occur |Mumbai |India |90 |
|39526 |saul 96 >> 78 has no idea |Delhi |India |60 |
|56783 |rahul 67 >> 6 will work for sure|Boston |USA |79 |
|78967 |rajesh 78 >> 4 I dont know |Hyderabad|India |78 |
|67526 |delta 89 >> 78 has no idea |Tel Aviv |Isreal |50 |
+---------+--------------------------------+---------+------------+------+