Question

我正在尝试将我从两个MapReduce作业中获得的结果进行合并。第一份工作返回5篇最具影响力的论文。下面是第一个减速器的代码。

import sys
import operator

current_word = None
current_count = 0
word = None
topFive = {}
# input comes from stdin
for line in sys.stdin:
    line = line.strip()

    # parse the input we got from mapper.py
    word, check = line.split('\t')
    if check != None:
        count = 1

    if current_word == word:
        current_count += count
    else:
        if current_word:
            topFive.update({current_word: current_count})
            #print(current_word, current_count)
        current_count = count
        current_word = word
if current_word == word:

    print(current_word, current_count)

t = sorted(topFive.iteritems(), key=lambda x:-x[1])[:6]
print("Top five most cited papers")
count = 1
for x in t:
    if x[0] != 'nan' and count <= 5:
        print("{0}: {1}".format(*x))
        count = count + 1

第二份工作找到了5位最具影响力的作者，其代码与上面的代码大致相同。我想从这两个工作中获得结果并加入他们，以便我可以为每位作者确定他们3篇最有影响力的论文的平均被引用次数。我不知道该怎么做，似乎我需要以某种方式加入结果？

Answer 1

到目前为止，您将最终得到两个输出目录，一个用于作者，一个用于论文。

现在，您要对两个文件都执行JOIN操作（如DBs术语）。为此，MapReduce方法是通过对两个输出文件执行此操作来完成第三项工作。

对Hadoop中的JOIN操作进行了深入研究。一种方法是减速器侧连接模式。该模式包括在映射器中创建一个包含两个子键的组合键（一个是原始键+一个布尔键，用于指定表0还是表1）。

在进入化简器之前，您需要制作一个分隔器以分隔这些组合键。减速器将从每个表中获取所有相同的键。

如果您需要进一步澄清，请告诉我，我写得很快。

将两个MapReduce作业的结果结合在一起

1 个答案: