具有由管道分隔的数字列表的文件可以具有重复项。需要编写map reduce程序列出原始输入顺序中没有重复的数字。能够删除重复项,但它不会保留输入顺序。
答案 0 :(得分:1)
非常简单,假设你的文字是:
Line 1 -> On the top of the Crumpetty Tree
Line 2 -> The Quangle Wangle sat,
Line 3 -> But his face you could not see,
Line 4 -> On account of his Beaver Hat.
Line 5 -> But his face you could not see,
Line 6 -> The Quangle Wangle sat,
Line 2
和3
在line 5
和6
重复的位置。
映射器应该类似于wordcount
程序,其中对mapper的输入类似于
键值对:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
(113, But his face you could not see,)
(146, The Quangle Wangle sat,)
映射器的输出
(NullWritable, 0_On the top of the Crumpetty Tree)
(NullWritable, 33_The Quangle Wangle sat,)
(NullWritable, 57_But his face you could not see,)
(NullWritable, 89_On account of his Beaver Hat.)
(NullWritable, 113_But his face you could not see,)
(NullWritable, 146_The Quangle Wangle sat,)
现在,请确保您只有一个减速器,以便单个减速器的输入为
输入到缩减器
Key: NullWritable
Iterable<value>: [(0_On the top of the Crumpetty Tree),
(33_The Quangle Wangle sat,),
(57_But his face you could not see,),
(89_On account of his Beaver Hat.),
(113_But his face you could not see,),
(146_The Quangle Wangle sat,)]
请注意,reducer的输入按升序顺序排序,在这种情况下,它保持原始顺序,因为offset
中的TextInputFormat
始终位于ascending
订单。
在reducer中,只需遍历列表,清除重复项并在开始时删除offset
和_
分隔符后写入行。 reducer输出类似于:
缩减器键值
NullWritable, value.split("_")[1]
减速机输出
Line 1 -> On the top of the Crumpetty Tree
Line 2 -> The Quangle Wangle sat,
Line 3 -> But his face you could not see,
Line 4 -> On account of his Beaver Hat.