在pyspark RDD上应用地图功能

时间:2016-10-20 11:05:14

标签: mongodb python-2.7 pyspark pyspark-sql

我通过阅读mongodb集合获得了一个rdd现在我想要更改一些值并将数据更新/加载回相同或其他集合。

mr1 = sc.mongoRDD('mongodb://localhost:27017/test_database.test2')
type(mr1)   #<class 'pyspark.rdd.PipelinedRDD'>
mr1.collect()
#[{u'_id': ObjectId('58089490d7531cd8b071f48c'), u'name': u'ravi', u'sal': u'2000'}, {u'_id': ObjectId('58089491d7531cd8b071f48d'), u'name': u'ravi', u'sal': u'3000'}]
#I want to change the name 'ravi' to 'Satya'
mr2 = mr1.map( lambda x: x['name'].replace('ravi','SATYA'))
#o/p: [u'SATYA', u'SATYA']  ##not all values
#Expected: [{u'_id': ObjectId('58089490d7531cd8b071f48c'), u'name': u'SATYA', u'sal': u'2000'}, {u'_id': ObjectId('58089491d7531cd8b071f48d'), u'name': u'SATYA', u'sal': u'3000'}]

请帮助,如何在此处应用地图功能以取回名称已更换的相同rdd mr1。

感谢。

2 个答案:

答案 0 :(得分:3)

尝试:

def replace(x, key, fr, to):
    d = x.copy()
    if key in d:
        d[key] = d[key].replace('ravi','SATYA')
    return d

mr1.map(lambda x: replace(x, 'name', 'ravi','SATYA'))

答案 1 :(得分:2)

搞定了 -

def rep(x):
    if x['name'] == 'ravi':
      x['name']='SATYA'
    return x
mr2 = mr1.map(lambda x: rep(x))