如何使生成器在spark mapPartitions()中工作?

时间:2019-02-15 23:09:19

标签: python apache-spark pyspark bigdata

我试图在spark中使用mapPartiton处理大型文本语料库: 假设我们有一些经过半处理的数据,如下所示:

    text_1 = [['A', 'B', 'C', 'D', 'E'],
    ['F', 'E', 'G', 'A', 'B'],
    ['D', 'E', 'H', 'A', 'B'],
    ['A', 'B', 'C', 'F', 'E'],
    ['A', 'B', 'C', 'J', 'E'],
    ['E', 'H', 'A', 'B', 'C'],
    ['E', 'G', 'A', 'B', 'C'],
    ['C', 'F', 'E', 'G', 'A'],
    ['C', 'D', 'E', 'H', 'A'],
    ['C', 'J', 'E', 'H', 'A'],
    ['H', 'A', 'B', 'C', 'F'],
    ['H', 'A', 'B', 'C', 'J'],
    ['B', 'C', 'F', 'E', 'G'],
    ['B', 'C', 'D', 'E', 'H'],
    ['B', 'C', 'F', 'E', 'K'],
    ['B', 'C', 'J', 'E', 'H'],
    ['G', 'A', 'B', 'C', 'F'],
    ['J', 'E', 'H', 'A', 'B']]

每个字母都是一个单词。我也有词汇:

    V = ['D','F','G','C','J','K']
    text_1RDD = sc.parallelize(text_1)

并且我想在spark中运行以下命令:

    filtered_lists = text_1RDD.mapPartitions(partitions)

    filtered_lists.collect()

我有这个功能:

    def partitions(list_of_lists,vc):

            for w in vc:

                iterator = []
                for sub_list in list_of_lists:

                    if w in sub_list:
                        iterator.append(sub_list)

        yield (w,len(iterator))

如果我这样运行:

    c = partitions(text_1,V)
    for item in c:
        print(item)

它返回正确的计数

    ('D', 4)
    ('F', 7)
    ('G', 5)
    ('C', 15)
    ('J', 5)
    ('K', 1)

但是,我不知道如何在spark中运行它:

    filtered_lists = text_1RDD.mapPartitions(partitions)

    filtered_lists.collect()

它只有一个参数,在Spark中运行时会产生很多错误...

但是,即使我在分区函数内部编写词汇表,也是如此:

    def partitionsV(list_of_lists):
            vc = ['D','F','G','C','J','K']
            for w in vc:

                iterator = []
                for sub_list in list_of_lists:

                    if w in sub_list:
                        iterator.append(sub_list)

        yield (w,len(iterator))

..我明白了:

    filtered_lists = text_1RDD.mapPartitions(partitionsV)

    filtered_lists.collect()

输出:

     [('D', 2),
     ('F', 0),
     ('G', 0),
     ('C', 0),
     ('J', 0),
     ('K', 0),
     ('D', 0),
     ('F', 0),
     ('G', 0),
     ('C', 0),
     ('J', 0),
     ('K', 0),
     ('D', 1),
     ('F', 0),
     ('G', 0),
     ('C', 0),
     ('J', 0),
     ('K', 0),
     ('D', 1),
     ('F', 0),
     ('G', 0),
     ('C', 0),
     ('J', 0),
     ('K', 0)]

很显然,生成器没有按预期工作。我完全被困住了。 我很新。如果有人可以向我解释一下这里发生的事情,我将非常感激。

1 个答案:

答案 0 :(得分:0)

这是另一个字数统计问题,<template> <div id="page-container"> <div id="page-content"> <h3 class="doc-header">Demo</h3> <div id="react-page"> </div> </div> </div> </template> <script> <script> import ReactApp from './ReactApp.jsx' import ReactDOM from 'react-dom' export default { data() { return { } }, } ReactDOM.hydrate(ReactApp, document.getElementById('#react-page')) </script> 并不是工作的工具:

mapPartitions

结果是

from operator import add

v = set(['D','F','G','C','J','K'])

result = text_1RDD.flatMap(v.intersection).map(lambda x: (x, 1)).reduceByKey(add)
for x in result.sortByKey().collect(): 
    print(x)