Question

我正在尝试在Hadoop上运行Python程序。该计划涉及NLTK库。该程序还使用了Hadoop Streaming API，如here所述。

mapper.py：

#!/usr/bin/env python
import sys
import nltk
from nltk.corpus import stopwords

#print stopwords.words('english')

for line in sys.stdin:
        print line,

reducer.py：

#!/usr/bin/env python

import sys
for line in sys.stdin:
    print line,

控制台命令：

bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output

这完全运行，输出只包含输入文件的行。

然而，当这一行（来自mapper.py）：

#print stopwords.words（'english'）

取消注释，然后程序失败并说

工作不成功。错误：超出允许的失败地图任务数限制。 FailedCount：1。

我在一个独立的python程序中检查过，

print stopwords.words（'english'）

工作得很好，所以我绝对难以理解为什么它导致我的Hadoop程序失败。

我非常感谢任何帮助！谢谢

Answer 1

使用以下命令解压缩：

importer = zipimport.zipimporter('nltk.zip')
    importer2=zipimport.zipimporter('yaml.zip')
    yaml = importer2.load_module('yaml')
    nltk = importer.load_module('nltk')

查看我粘贴的链接。他们提到了所有步骤。

Answer 2

'{em> english '是print stopwords.words('english')中的文件吗？如果是，您还需要使用-file将其发送到节点。

Hadoop和NLTK：使用停用词失败

2 个答案: