Question

我参考了这篇热门博文：

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

在这里，作者首先演示了一个简单的常规Python映射器

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)

然后他提到：

但是，在实际应用程序中，您可能希望优化您的应用程序使用Python迭代器和生成器的代码（更好 PDF格式的介绍。

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""

import sys

def read_input(file):
    for line in file:
        # split the line into words
        yield line.split()

def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_input(sys.stdin)
    for words in data:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        for word in words:
            print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":
    main()

我的问题：为什么第二张映射器比第一张映射器效率更高？

如果我正确理解yield;它在另一个呼叫之前暂停该功能。因此，基本上每当调用data进行迭代时，read_input就会对yield另一个项目起作用。

但即使在第一个简单的映射器中，我们也在做同样的事情吗？ for line in std.stdin:基本上会加载运行映射器的主机可用的stdin;我们逐行操作。

使用yield有什么好处？我对第一个获得什么样的收益？速度？存储器？

非常感谢。

编辑：我不确定为什么人们会认为它是重复的。我不是在询问＆＃39; yeild＆＃39;关键工作工作，我询问有关它在hadoop流映射上下文中提供什么好处的解释。

了解Python中的Hadoop映射器

0 个答案: