我试图将NLTK与Hadoop集成。基本上我想pos_tag的话。我尝试了以下链接: http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/
但是,在运行MapReduce程序时,我仍然收到错误:
14/12/09 11:45:53 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201412091132_0004_m_000000
14/12/09 11:45:53 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
我的Mapper程序是:
#!/usr/bin/env python
import sys
import os
import re
#import sys
import zipimport
importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer.load_module('nltk')
# input comes from STDIN (standard input)
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
a=nltk.pos_tag(word)
print '%s\t%s' % (word, 1)
我使用与字数统计示例相同的Reducer程序。我是Hadoop的新手。请帮忙。