Question

我想编写一个map-reduce算法，用于查找每个组的前N个值（A或D顺序）

Input data
a,1
a,9
b,3
b,5
a,4
a,7
b,1
c,1
c,9
c,-2
d,1
b,1
a,10
1,19 

output type 1
a 1,4,7,9 ,10 , 19
b 1,,1,3,5
c -2,1,9
d 1

output type 2
a 19, 10 , 9,7,4,1
b 5,3,1,1
c 9,1,-2
d 1

前三名的输出类型1

a 1,4,7
b 1,,1,3
c -2,1
d 1

请指导我

Answer 1

您需要编写一个映射器，用逗号分隔输入行并生成一对Text，IntWritable：

Text('a,1') -> (mapper) -> Text('a'), IntWritable(1)

在reducer中，您将拥有组和值列表。您需要使用priority queue从列表中选择前K个值：

// add all values to priority queue
PriorityQueue<Integer> queue = new PriorityQueue<Integer>();
for (IntWritable value : values)
    queue.add(value.get());

// get first K elements from priority queue
String topK = String.valueOf(queue.poll());
for (int i = 0; i < K - 1; ++i)
    topK += ", " + queue.poll();

Answer 2

在Scalding中（假设tsv中的数据），它将类似于

Tsv(path, ('key, 'value)).groupBy('key)(_.sortWithTake('value -> 'value, N))
.write(Tsv(outputPath))

使用mapreduce的每个组的top-K值

2 个答案: