我有一个数据框,其中有一列,每行包含一个字典列表:
[
Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'},{...}]"),
Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'},{...}]")
]
如何将其解析为这样的数据帧结构:
key1 | key2 | key3 | keyN |
value1|value2|value3|valueN|
value1|value2|value3|valueN|
答案 0 :(得分:0)
您可以按以下步骤操作:
from pyspark.sql import Row
l = [Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'}]"),
Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'}]")]
# convert the list of Rows to an RDD:
ll = sc.parallelize(l)
df = sqlContext.read.json(ll.map(lambda r: dict(
kv for d in eval(r.payload) for kv in d.items())))
说明:
我想唯一的歧义在于以下中间代码:
dict(kv for d in eval(r.payload) for kv in d.items())
用于转换此格式
[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'}]"
到这一个:
{'key3': 'value3', 'key2': 'value2', 'key1': 'value1'}
输出:
>>>df
DataFrame[key1: string, key2: string, key3: string]
>>> df.show()
+------+------+------+
| key1| key2| key3|
+------+------+------+
|value1|value2|value3|
|value1|value2|value3|
+------+------+------+
答案 1 :(得分:-1)
获得预期的数据帧结构:
import pandas as pd
from pyspark.sql import *
dataframe = [
Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'}]"),
Row(payload=u"[{'key1':'value4'},{'key2':'value5'},{'key3':'value6'}]")]
new_data = [eval(row['payload']) for row in dataframe]
# [[{'key1': 'value1'}, {'key2': 'value2'}, {'key3': 'value3'}], [{'key1': 'value4'}, {'key2': 'value5'}, {'key3': 'value6'}]]
data_list = []
for sub_list in new_data:
dict_list = {}
for dict_val in sub_list:
dict_list.update(dict_val)
data_list.append(dict_list)
# [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}, {'key1': 'value4', 'key2': 'value5', 'key3': 'value6'}]
df = pd.DataFrame(data_list)
# key1 key2 key3
# 0 value1 value2 value3
# 1 value4 value5 value6