应用错误收集

我有两份来自不同活动的人员名单;我想在这些列表中找到匹配的人名，以及匹配的公司。据我所知，每个列表中可能会有同名的人不是同一个人，但它有助于找到匹配项。

第一个列表示例：
姓名，公司，职称
John Doe，ACME公司，大象训练师
简史密斯，ACME公司，CEO John Smith，Widgets-R-Us，Janitor
+ 10,000行

第二个名单示例：
姓名，公司
ACME公司Fred Smith John Smith，Widgets-R-Us
John Smith，XYZ公司 Jane Smith，XYZ公司 + 10,000行

期望输出
匹配名称：
约翰史密斯简史密斯

匹配公司：
ACME公司
窗口小部件-R-我们

我在AWS环境中运行它，并且是Hadoop的新手。任何编程语言都可以。我知道如何在excel中执行此操作，但希望能够随着时间的推移使用更多名称列表（每个名称都在自己的CSV文件中）进行扩展。

谢天谢地！

您需要一个Mapper实现，您可以在其中将Name和Company Name作为Text和IntWritable发出 protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{ /*Some logic to derive the person name or the Company name.*/ String name = value.split(',')[0]; context.write(new Text(value),new IntWritable(1)); }

Reducer中reduce方法的实现类似于
public void reduce(Text key, Iterable<IntWritable> values,Context context)throws IOException, InterruptedException{ int count = 1; for(IntWritable val: values){count++;} //You would all the unique names with no of times it is repeated. context.write(key,new IntWritable(count)); }
希望这会有所帮助。

Hadoop - 在两个客户列表中查找匹配的名称

1 个答案: