如何使用MapReduce程序从数据集中找到20个访问量最大的目的地?

时间:2018-06-04 02:33:38

标签: java mapreduce

旅游行业数据集说明:

Column 1: City pair (Combination of from and to): String  
Column 2: From location: String  
Column 3: To Location: String  
Column 4: Product type: Integer (1=Air, 2=Car, 3 =Air + Car, 4 =Hotel, 5=Air + Hotel, 6=Hotel + Car, 7 =Air +
Hotel + Car)  
Column 5: Adults Traveling: Integer  
Column 6: Seniors traveling: Integer  
Column 7: Children traveling: Integer  
Column 8: Youth traveling: Integer  
Column 9: Infant traveling: Integer  
Column 10: Air booking price: Float  
Column 11: Car booking price: Float  
Column 12: Hotel booking price: Float  
Column 13: Airline code: String  
Column 14: Airline name: String  
Column 15: Car vendor code: String  
Column 16: Hotel name: String  

1 个答案:

答案 0 :(得分:0)

一种选择是运行两个MR作业:
job1 map:发出["To",1]
job1 reduce:找到每个目的地的计数并输出["To", count]
job2 map:发出上一个作业的输出([count, "To"]["To", count]
job2 reduce(减少者数量= 1):排序并输出最多值为count的20行

对于像这样的查询任务,最好使用类似于Apache Hive的类似SQL的查询引擎。 Hive会将查询转换为上面指定的2个Map-Reduce作业。