从Jupyter Notebook运行MapReduce

时间:2017-02-19 23:18:11

标签: python jupyter-notebook mrjob

我试图在u.data文件中的数据集上运行Jupyter Notebook中的MapReduce,但是我一直收到一条错误消息,上面写着

  

" TypeError:' str'对象不支持项目删除"。

如何让代码成功运行?

u.data包含以下信息:

196 242 3   881250949
186 302 3   891717742
22  377 1   878887116
244 51  2   880606923
166 346 1   886397596
298 474 4   884182806
115 265 2   881171488
253 465 5   891628467
305 451 3   886324817
6   86  3   883603013

以下是代码:

from mrjob.job import MRJob

class MRRatingCounter(MRJob):
    def mapper(self, key, line):
        (userID, movieID, rating, timestamp) = line.split("\t")
        yield rating, 1

    def reducer(self, rating, occurences):
        yield rating, sum(occurences)

if __name__ == "main__":
    MRRatingCounter.run()

filepath = "u.data"

MRRatingCounter(filepath)

如果代码保存在.py文件下并使用命令行,则此代码成功运行:!python ratingCounter.py u.data

2 个答案:

答案 0 :(得分:0)

MRRatingCounter需要存在于自己的.py文件中,让我们说MRRatingCounter.py:

from mrjob.job import MRJob

class MRRatingCounter(MRJob):

    def mapper(self, key, line):
        (userID, movieID, rating, timestamp) = line.split("\t")
        yield rating, 1

    def reducer(self, rating, occurences):
        yield rating, sum(occurences)

if __name__ == "__main__":
    MRRatingCounter.run()

将课程导入您的笔记本并通过跑步者执行:

from MRRatingCounter import MRRatingCounter

mr_job = MRRatingCounter(args=['u.data'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        #handle each line however you like
        print line

答案 1 :(得分:0)

就像您提到的那样,重要的部分是将文件保存为.py格式,为此必须包括%%file filename.py

在这种情况下,我添加了rc.py作为文件名,所有代码都进入了一个单元格:

%%file rc.py
from mrjob.job import MRJob
class MRRatingCounter(MRJob):
    def mapper(self, key, line):
        (userId, movieId, rating, timestamp) = line.split('\t')
        yield rating, 1

    def reducer(self, rating, occurances):
        yield rating, sum(occurances)

if __name__ == '__main__':
    MRRatingCounter.run()

运行单元后,在下一个单元中,您可以运行以下命令:

!python rc.py u.data

这将为您提供所需的输出。