我必须从一些输入数据中获得不同的统计数据。这是我想要的映射器
label_1 = "tot_lan"
label_2 = "max_lan"
label_3 = "ur_bin"
label_4 = "ur_record
for line in sys.stdin:
line = line.strip()
line2 = line.split('|')
lang = line2[-1]
total_docs_counter +=1
if lang=='-':
no_lang_info +=1
continue
doc_url = line2[0]
lang = lang .split('&')
max_lang = ""
max_percent = 0
for num, ln in enumerate(lang):
if num==0:
continue
tmp = ln.split('-')
lang_id = tmp[0]
# update_lang_statistics(total_lang_record, lang_id)
print "%s\t%s\t1" %(label_1, lang_id)
print "%s\t%s\t1" % (label_2, max_lang)
我为每个输出都给了一个标签,这样在reducer中我可以检测它属于哪个类别。然后通过简化条件表达式在reducer中,我可以获得所需的输出。很明显,内环输出将超过外环。我在本地进行测试,如
cat input | ./mapper.py | sort | ./reducer.py
它工作正常但是当我在hadoop中为someple数据运行这个作业时,看起来没有发生reducer动作,即输出文件包含mapper输出。映射器输出未正确地减少。我的映射器非常简单,只是将每个类别分别汇总。问题出在哪儿。还有其他更好的工作吗?
这是减速器代码
def counting_reducer(line, temp, count, label):
label,name, freq = line.split("\t")
freq = int(freq)
if name == temp:
count += freq
else:
if temp:
print '%s\t%s\t%s' % (label, temp, count)
count = freq
temp = name
return [temp, count]
label_1 = "tot_lan"
label_2 = "max_lan"
label_3 = "ur_bin"
label_4 = "ur_record"
#TMP variables
temp_l1 = None
count_l1 = 0
temp_l2 = None
count_l2 = 0
temp_l3 = None
count_l3 = 0
temp_l4 = None
count_l4 = 0
urdu_bin_record = {}
skip = 0
for line in sys.stdin:
line = line.strip()
if line.startswith(label_1):
temp_l1, count_l1 = counting_reducer(line, temp_l1, count_l1, label_1)
elif line.startswith(label_2):
temp_l2, count_l2 = counting_reducer(line, temp_l2, count_l2, label_2)
else:
print line