我最初写过:
previous_word = ""
word_ct = 0
for line in sys.stdin:
word, count = line.split()
if word == previous_word:
word_ct += int(count)
else:
if previous_word != "":
print(previous_word, word_ct)
previous_word = word
word_ct = int(count)
# Print the final word and count
print(previous_word, word_ct)
用作字计数器。现在我有垃圾邮件/火腿类,并且希望将部分计数加到总计数中,如下所示:
#!/usr/bin/env python
"""
Reducer takes words with their class and partial counts and computes totals.
INPUT:
word \t class \t partialCount
OUTPUT:
word \t class \t totalCount
"""
import re
import sys
# initialize trackers
current_word = None
spam_count, ham_count = 0,0
# read from standard input
for line in sys.stdin:
# parse input
word, is_spam, count = line.split('\t')
if current_word == word:
#print(word, is_spam)
if is_spam == 1:
spam_count += int(count)
else:
ham_count += int(count)
else:
if current_word:
if is_spam == 1:
print("%s\t%s\t%s" % (current_word, is_spam, spam_count))
spam_count = int(count)
else:
print("%s\t%s\t%s" % (current_word, is_spam, ham_count))
ham_count = int(count)
current_word = word
if current_word == word:
if int(is_spam) == 1:
print("%s\t%s\t%s" % (word, is_spam, spam_count))
else:
print("%s\t%s\t%s" % (word, is_spam, ham_count))
使用此方法,我的代码通常可以正常工作,但不能正确处理第一个数据。使用:
!echo -e "one 1 1\none 0 1\none 0 1\ntwo 0 1" | reducer.py
我只得到:
one 0 2
two 0 1
我发现第一个条目被跳过了,因为如果current_word:更改了变量is_spam,看起来呢?这次迭代让我感到困惑...
来源:http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/