pyspark,我应该如何合并我的结果

时间:2016-06-14 09:08:07

标签: python mongodb apache-spark pyspark rdd

在mongoDB中显示我的示例数据:

> db.stackin.find({})
{ "_id" : ObjectId("575ce909aa02c3b21f1be0bb"), "summary" : "good good day", "url" : "url_1" }
{ "_id" : ObjectId("575ce909aa02c3b21f1be0bc"), "summary" : "hello world good world", "url" : "url_2" }
{ "_id" : ObjectId("575ce909aa02c3b21f1be0bd"), "summary" : "hello world good hello good", "url" : "url_3" }
{ "_id" : ObjectId("575ce909aa02c3b21f1be0be"), "summary" : "hello world hello", "url" : "url_4" }

我想要的是在每个网址中获取所有字数。

因此,上述数据的结果是

{"good": [{"url_1": 2}, {"url_2": 1}, {"url_3": 2}]}
{"day: [{"url_1": 1}]}
{"hello": [{"url_2": 1}, {"url_3": 2}, {"url_4": 2}]}
{"world": [{"url_2": 2}, {"url_3": 1}, {"url_4": 1}]}

我的代码:

import pyspark
import re
import collections
import pymongo_spark
pymongo_spark.activate()
rdd = sc.mongoPairRDD("mongodb://localhost/testmr.stackin")

def f(record):
    """"""
    raw_summary = record[1]['summary']
    summary = re.sub("[\.\!\/,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*()]+".decode("utf8"),
                "".decode("utf8"), raw_summary)
    url = record[1]['url']
    _temp = dict(collections.Counter(summary.split()))
    result = [(key,{url:value}) for key,value in _temp.items()]
    print result

rdd.foreach(f)

显示rdd数据

>>> rdd.collect()
[(ObjectId('575ce909aa02c3b21f1be0bb'),
  {u'_id': ObjectId('575ce909aa02c3b21f1be0bb'),
   u'summary': u'good good day',
   u'url': u'url_1'}),
 (ObjectId('575ce909aa02c3b21f1be0bc'),
  {u'_id': ObjectId('575ce909aa02c3b21f1be0bc'),
   u'summary': u'hello world good world',
   u'url': u'url_2'}),
 (ObjectId('575ce909aa02c3b21f1be0bd'),
  {u'_id': ObjectId('575ce909aa02c3b21f1be0bd'),
   u'summary': u'hello world good hello good',
   u'url': u'url_3'}),
 (ObjectId('575ce909aa02c3b21f1be0be'),
  {u'_id': ObjectId('575ce909aa02c3b21f1be0be'),
   u'summary': u'hello world hello',
   u'url': u'url_4'})]

它将打印结果

[(u'good', {u'url_1': 2}), (u'day', {u'url_1': 1})]
[(u'world', {u'url_2': 2}), (u'good', {u'url_2': 1}), (u'hello', {u'url_2': 1})]
[(u'world', {u'url_3': 1}), (u'good', {u'url_3': 2}), (u'hello', {u'url_3': 2})]
[(u'world', {u'url_4': 1}), (u'hello', {u'url_4': 2})]

我认为我的代码中的函数f无法返回RDD实例。它可以打印结果。

我应该如何合并并合并结果?

0 个答案:

没有答案