无法包含带有Hadoop流式传输作业的python包(nltk)

时间:2018-04-09 05:22:45

标签: python hadoop nltk hadoop-streaming

我正在关注几个帖子,包括带有hadoop流媒体作业的nltk。例如post1post2

我的映射器代码如下:

#!/usr/bin/env python

'''
trainingMapper1_test.py
'''


import sys
import zipimport


importer = zipimport.zipimporter('nltk.mod')
nltk = importer.load_module('nltk')
from nltk.tokenize import WordPunctTokenizer


def getToken(input):
    tokenizer = WordPunctTokenizer()
    allTokens = tokenizer.tokenize(input)
    return  allTokens

def read_input(file):
    for line in file:
        yield getToken(line)

def main(separator='\t'):
    data = read_input(sys.stdin)
    for tokenlist in data:
        for token in tokenlist:
            print ('%s%s%d' % (token, separator, 1))

if __name__ == "__main__":
    main()

基本上我只是对行进行标记,然后通过令牌打印令牌。 使用以下命令测试时脚本正常:

python trainingMapper1_test.py < input.txt

然而,当我用:

运行hadoop时
hadoop  jar $JARFILE  \
-D mapreduce.job.reduces=0 \
-input input_dir \
-output justatest \
-file nltk.mod \
-file trainingMapper1_test.py \
-mapper trainingMapper1_test.py

其中input_dir只包含一个文件,即input.txt,我有例外:

  

18/04/09 01:03:31 INFO mapreduce.Job:任务ID:attempt_1523248713710_0006_m_000000_0,状态:未通过   错误:java.lang.RuntimeException:PipeMapRed.waitOutputThreads():子进程失败,代码为1       在org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)       在org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)       在org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)       在org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)       在org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)       在org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)       在org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)       在org.apache.hadoop.mapred.YarnChild $ 2.run(YarnChild.java:177)       at java.security.AccessController.doPrivileged(Native Method)       在javax.security.auth.Subject.doAs(Subject.java:422)       在org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)       在org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)

我测试过,导致输入失败了:

importer = zipimport.zipimporter('nltk.mod')

但我不知道如何解决这个问题,似乎nltk.mod没有发送给工人,或者工人无法加载它。有人可以帮我解决这个问题吗?

0 个答案:

没有答案