我试图在spark中使用mapPartiton处理大型文本语料库: 假设我们有一些经过半处理的数据,如下所示:
text_1 = [['A', 'B', 'C', 'D', 'E'],
['F', 'E', 'G', 'A', 'B'],
['D', 'E', 'H', 'A', 'B'],
['A', 'B', 'C', 'F', 'E'],
['A', 'B', 'C', 'J', 'E'],
['E', 'H', 'A', 'B', 'C'],
['E', 'G', 'A', 'B', 'C'],
['C', 'F', 'E', 'G', 'A'],
['C', 'D', 'E', 'H', 'A'],
['C', 'J', 'E', 'H', 'A'],
['H', 'A', 'B', 'C', 'F'],
['H', 'A', 'B', 'C', 'J'],
['B', 'C', 'F', 'E', 'G'],
['B', 'C', 'D', 'E', 'H'],
['B', 'C', 'F', 'E', 'K'],
['B', 'C', 'J', 'E', 'H'],
['G', 'A', 'B', 'C', 'F'],
['J', 'E', 'H', 'A', 'B']]
每个字母都是一个单词。我也有词汇:
V = ['D','F','G','C','J','K']
text_1RDD = sc.parallelize(text_1)
并且我想在spark中运行以下命令:
filtered_lists = text_1RDD.mapPartitions(partitions)
filtered_lists.collect()
我有这个功能:
def partitions(list_of_lists,vc):
for w in vc:
iterator = []
for sub_list in list_of_lists:
if w in sub_list:
iterator.append(sub_list)
yield (w,len(iterator))
如果我这样运行:
c = partitions(text_1,V)
for item in c:
print(item)
它返回正确的计数
('D', 4)
('F', 7)
('G', 5)
('C', 15)
('J', 5)
('K', 1)
但是,我不知道如何在spark中运行它:
filtered_lists = text_1RDD.mapPartitions(partitions)
filtered_lists.collect()
它只有一个参数,在Spark中运行时会产生很多错误...
但是,即使我在分区函数内部编写词汇表,也是如此:
def partitionsV(list_of_lists):
vc = ['D','F','G','C','J','K']
for w in vc:
iterator = []
for sub_list in list_of_lists:
if w in sub_list:
iterator.append(sub_list)
yield (w,len(iterator))
..我明白了:
filtered_lists = text_1RDD.mapPartitions(partitionsV)
filtered_lists.collect()
输出:
[('D', 2),
('F', 0),
('G', 0),
('C', 0),
('J', 0),
('K', 0),
('D', 0),
('F', 0),
('G', 0),
('C', 0),
('J', 0),
('K', 0),
('D', 1),
('F', 0),
('G', 0),
('C', 0),
('J', 0),
('K', 0),
('D', 1),
('F', 0),
('G', 0),
('C', 0),
('J', 0),
('K', 0)]
很显然,生成器没有按预期工作。我完全被困住了。 我很新。如果有人可以向我解释一下这里发生的事情,我将非常感激。
答案 0 :(得分:0)
这是另一个字数统计问题,<template>
<div id="page-container">
<div id="page-content">
<h3 class="doc-header">Demo</h3>
<div id="react-page">
</div>
</div>
</div>
</template>
<script>
<script>
import ReactApp from './ReactApp.jsx'
import ReactDOM from 'react-dom'
export default {
data() {
return {
}
},
}
ReactDOM.hydrate(ReactApp, document.getElementById('#react-page'))
</script>
并不是工作的工具:
mapPartitions
结果是
from operator import add
v = set(['D','F','G','C','J','K'])
result = text_1RDD.flatMap(v.intersection).map(lambda x: (x, 1)).reduceByKey(add)
for x in result.sortByKey().collect():
print(x)