我正在尝试从IPython笔记本运行mrjob示例
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
然后用代码
运行它mr_job = MRWordFrequencyCount(args=["testfile.txt"])
with mr_job.make_runner() as runner:
runner.run()
for line in runner.stream_output():
key, value = mr_job.parse_output_line(line)
print key, value
并收到错误:
TypeError: <module '__main__' (built-in)> is a built-in class
有没有办法从IPython笔记本运行mrjob?
答案 0 :(得分:1)
我怀疑这是由于MRJob网站上的this limitation所述:
将具有作业类的文件发送到Hadoop以进行运行。因此, 作业文件无法尝试启动Hadoop作业,或者您将尝试启动 递归创建Hadoop作业!运行作业的代码应该 只能在Hadoop上下文之外运行。
或者,可能是因为您没有以下(reference):
if __name__ == '__main__':
MRWordCounter.run() # where MRWordCounter is your job class
答案 1 :(得分:1)
我还没有找到完美的方式&#34;但是,你可以做的一件事是创建一个笔记本单元格,使用%%file
魔法,将单元格内容写入文件:
%%file wordcount.py
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
然后让mrjob
在稍后的单元格中运行该文件:
import wordcount
reload(wordcount)
mr_job = wordcount.MRWordFrequencyCount(args=['example.txt'])
with mr_job.make_runner() as runner:
runner.run()
for line in runner.stream_output():
key, value = mr_job.parse_output_line(line)
print key, value
请注意,我调用了文件wordcount.py
,并从MRWordFrequencyCount
模块导入了类wordcount
- 文件名和模块必须匹配。 Python也会缓存导入的模块,当你更改wordcount.py
文件时,iPython不会重新加载模块,而是使用旧的缓存模块。这就是我将reload()
电话放在那里的原因。
参考:https://groups.google.com/d/msg/mrjob/CfdAgcEaC-I/8XfJPXCjTvQJ
更新(更短)
对于较短的第二个笔记本电脑单元,您可以通过从笔记本中调用shell来运行mrjob
! python mrjob.py shakespeare.txt
参考:http://jupyter.cs.brynmawr.edu/hub/dblank/public / Jupyter%20Magics.ipynb