Question

我比较了以下两个缩减器在字数统计方面的时间性能。这些化简器的不同之处在于它们是否利用了按键排序的输入。

Reducer 1（不使用已排序的输入）：

#!/usr/bin/python
import sys

# maps words to their counts
word2count = {}

for line in sys.stdin:
    w = line.strip().split()[0] # this is the word
    word2count[w] = (word2count[w] + 1 if word2count.has_key(w) 
                     else 1)

# Write (unsorted) tuples to stdout
for word in word2count.keys():
    print '%s\t%s' % (word, word2count[word])

Reducer 2（利用已排序的输入）：

#!/usr/bin/python
import sys

# maps words to their counts
word2count = {}
last = ""
count = 0

for line in sys.stdin:
    w = line.strip().split()[0] # this is the word
    if w != last and count != 0:
        word2count[last] = count
        last = w
        count = 1
    else: count += 1
if last != "": word2count[last] = count

# Write (unsorted) tuples to stdout
for word in word2count.keys():
    print '%s\t%s' % (word, word2count[word])

两个reducer使用了相同的映射器：

#!/usr/bin/python
import sys
import string

#--- get all lines from stdin ---
for line in sys.stdin:
    #--- to lower case and remove punctuation ---
    line = line.lower().translate(None, string.punctuation)

    #--- split the line into words ---
    words = line.split()

    #--- output tuples [word, 1] in tab-delimited format---
    for word in words: 
        print '%s\t%s' % (word, "1")

我使用the English translation of "War and Peace"作为输入。减速器的时间性能（CPU时间）差异约为20％。

这是我用来测量时间的命令行：

./mapper.py < war_and_peace.txt | sort | time ./reducer.py > /dev/null

考虑到第一个化简器要简单得多，并且对化简器的输入进行排序会花费时间（这可能会消耗掉20％的时间），我的问题是：hadoop为什么对化简器的输入进行排序？是否有比“字数统计”更重要的问题？（请注意：，我意识到需要对每个映射器的输出进行排序，以平衡化简器的负载。我的问题是关于合并来自不同映射器的键值对的动机映射器，而不是简单地附加它们。）

Answer 1

这是我认为是正确的答案（除非将这个问题标记为重复的人可以在发现的帖子中将我指向这个答案，否则应该感到羞耻）。这个问题忽略了记忆方面。将关键字存储在字典中时会假定所有关键字都可以容纳在内存中，通常情况并非如此。按键对reducers的输出进行排序，一次只能使用一个键。

hadoop为什么对减速器的输入进行排序？

1 个答案: