Pyspark |映射JSON rdd并应用广播

时间:2018-12-05 14:57:23

标签: pyspark rdd

在pyspark中,如何在将broadcast变量应用于值列表时,将具有JSON的输入RDD转换为以下指定的输出?

输入

std

广播变量

[{'id': 1, 'title': "Foo", 'items': ['a','b','c']}, {'id': 2, 'title': "Bar", 'items': ['a','b','d']}]

所需的输出

[('a': 5), ('b': 12), ('c': 42), ('d': 29)]

1 个答案:

答案 0 :(得分:1)

编辑:最初,我的印象是传递给map函数的函数会自动广播,但是在阅读了一些文档后,我不再确定。

在任何情况下,您都可以定义广播变量:

bv = [('a', 5), ('b', 12), ('c', 42), ('d', 29)]

# turn into a dictionary
bv = dict(bv)
broadcastVar = sc.broadcast(bv)
print(broadcastVar.value)
#{'a': 5, 'c': 42, 'b': 12, 'd': 29}

现在是available on all machines as a read-only variable。您可以使用broascastVar.value访问字典:

例如:

import json

rdd = sc.parallelize(
    [
        '{"id": 1, "title": "Foo", "items": ["a","b","c"]}',
        '{"id": 2, "title": "Bar", "items": ["a","b","d"]}'
    ]
)

def myMapper(row):
    # define the order of the values for your output
    key_order = ["id", "title", "items"]

    # load the json string into a dict
    d = json.loads(row)

    # replace the items using the broadcast variable dict
    d["items"] = [broadcastVar.value.get(item) for item in d["items"]]

    # return the values in order
    return tuple(d[k] for k in key_order)

print(rdd.map(myMapper).collect())
#[(1, u'Foo', [5, 12, 42]), (2, u'Bar', [5, 12, 29])]