减少2个类别的字数

时间:2019-01-23 02:43:55

标签: python hadoop streaming word-count

我最初写过:

previous_word = ""
word_ct = 0

for line in sys.stdin:
    word, count = line.split()
    if word == previous_word:
        word_ct += int(count)
    else:
        if previous_word != "":
            print(previous_word, word_ct)
        previous_word = word
        word_ct = int(count)

# Print the final word and count
print(previous_word, word_ct)

用作字计数器。现在我有垃圾邮件/火腿类,并且希望将部分计数加到总计数中,如下所示:

#!/usr/bin/env python
"""
Reducer takes words with their class and partial counts and computes totals.
INPUT:
    word \t class \t partialCount 
OUTPUT:
    word \t class \t totalCount  
"""
import re
import sys

# initialize trackers
current_word = None
spam_count, ham_count = 0,0

# read from standard input
for line in sys.stdin:
    # parse input
    word, is_spam, count = line.split('\t')

    if current_word == word:
        #print(word, is_spam)
        if is_spam == 1:
            spam_count += int(count)
        else:    
            ham_count += int(count)
    else:
        if current_word:
            if is_spam == 1:
                print("%s\t%s\t%s" % (current_word, is_spam, spam_count))
                spam_count = int(count)
            else:
                print("%s\t%s\t%s" % (current_word, is_spam, ham_count))
                ham_count = int(count)
        current_word = word

if current_word == word:
    if int(is_spam) == 1:
        print("%s\t%s\t%s" % (word, is_spam, spam_count))
    else:
        print("%s\t%s\t%s" % (word, is_spam, ham_count))

使用此方法,我的代码通常可以正常工作,但不能正确处理第一个数据。使用:

!echo -e "one   1   1\none  0   1\none  0   1\ntwo  0   1" | reducer.py

我只得到:

one 0   2
two 0   1

我发现第一个条目被跳过了,因为如果current_word:更改了变量is_spam,看起来呢?这次迭代让我感到困惑...

来源:http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

0 个答案:

没有答案