将NLTK与Hadoop集成时出错

时间:2014-12-09 06:44:32

标签: hadoop nltk

我试图将NLTK与Hadoop集成。基本上我想pos_tag的话。我尝试了以下链接: http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

但是,在运行MapReduce程序时,我仍然收到错误:

14/12/09 11:45:53 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201412091132_0004_m_000000
14/12/09 11:45:53 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

我的Mapper程序是:

#!/usr/bin/env python

import sys
import os
import re
#import sys
import zipimport

importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer.load_module('nltk')


# input comes from STDIN (standard input)
for line in sys.stdin:

  line = line.strip()

  words = line.split()

  for word in words:

    a=nltk.pos_tag(word)
    print '%s\t%s' % (word, 1)

我使用与字数统计示例相同的Reducer程序。我是Hadoop的新手。请帮忙。

0 个答案:

没有答案