在mongoDB中显示我的示例数据:
> db.stackin.find({})
{ "_id" : ObjectId("575ce909aa02c3b21f1be0bb"), "summary" : "good good day", "url" : "url_1" }
{ "_id" : ObjectId("575ce909aa02c3b21f1be0bc"), "summary" : "hello world good world", "url" : "url_2" }
{ "_id" : ObjectId("575ce909aa02c3b21f1be0bd"), "summary" : "hello world good hello good", "url" : "url_3" }
{ "_id" : ObjectId("575ce909aa02c3b21f1be0be"), "summary" : "hello world hello", "url" : "url_4" }
我想要的是在每个网址中获取所有字数。
因此,上述数据的结果是
{"good": [{"url_1": 2}, {"url_2": 1}, {"url_3": 2}]}
{"day: [{"url_1": 1}]}
{"hello": [{"url_2": 1}, {"url_3": 2}, {"url_4": 2}]}
{"world": [{"url_2": 2}, {"url_3": 1}, {"url_4": 1}]}
我的代码:
import pyspark
import re
import collections
import pymongo_spark
pymongo_spark.activate()
rdd = sc.mongoPairRDD("mongodb://localhost/testmr.stackin")
def f(record):
""""""
raw_summary = record[1]['summary']
summary = re.sub("[\.\!\/,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*()]+".decode("utf8"),
"".decode("utf8"), raw_summary)
url = record[1]['url']
_temp = dict(collections.Counter(summary.split()))
result = [(key,{url:value}) for key,value in _temp.items()]
print result
rdd.foreach(f)
显示rdd数据
>>> rdd.collect()
[(ObjectId('575ce909aa02c3b21f1be0bb'),
{u'_id': ObjectId('575ce909aa02c3b21f1be0bb'),
u'summary': u'good good day',
u'url': u'url_1'}),
(ObjectId('575ce909aa02c3b21f1be0bc'),
{u'_id': ObjectId('575ce909aa02c3b21f1be0bc'),
u'summary': u'hello world good world',
u'url': u'url_2'}),
(ObjectId('575ce909aa02c3b21f1be0bd'),
{u'_id': ObjectId('575ce909aa02c3b21f1be0bd'),
u'summary': u'hello world good hello good',
u'url': u'url_3'}),
(ObjectId('575ce909aa02c3b21f1be0be'),
{u'_id': ObjectId('575ce909aa02c3b21f1be0be'),
u'summary': u'hello world hello',
u'url': u'url_4'})]
它将打印结果
[(u'good', {u'url_1': 2}), (u'day', {u'url_1': 1})]
[(u'world', {u'url_2': 2}), (u'good', {u'url_2': 1}), (u'hello', {u'url_2': 1})]
[(u'world', {u'url_3': 1}), (u'good', {u'url_3': 2}), (u'hello', {u'url_3': 2})]
[(u'world', {u'url_4': 1}), (u'hello', {u'url_4': 2})]
我认为我的代码中的函数f
无法返回RDD实例。它可以打印结果。
我应该如何合并并合并结果?