我在kafka流媒体之后得到了这个RDD。我想在不定义Schema的情况下将其转换为数据帧。
[
{u'Enrolment_Date': u'2008-01-01', u'Freq': 78},
{u'Enrolment_Date': u'2008-02-01', u'Group': u'Recorded Data'},
{u'Freq': 70, u'Group': u'Recorded Data'},
{u'Enrolment_Date': u'2008-04-01', u'Freq': 96}
]
答案 0 :(得分:0)
您可以使用OrderedDict
将包含键值对的RDD转换为数据帧。但是,在您的情况下,并非所有键都存在于每一行中,因此您需要首先使用None
值填充这些键。请参阅下面的解决方案
#Define test data
data = [{u'Enrolment_Date': u'2008-01-01', u'Freq': 78}, {u'Enrolment_Date': u'2008-02-01', u'Group': u'Recorded Data'}, {u'Freq': 70, u'Group': u'Recorded Data'}, {u'Enrolment_Date': u'2008-04-01', u'Freq': 96}]
rdd = sc.parallelize(data)
from pyspark.sql import Row
from collections import OrderedDict
#Determine all the keys in the input data
schema = rdd.flatMap(lambda x: x.keys()).distinct().collect()
#Add missing keys with a None value
rdd_complete= rdd.map(lambda r:{x:r.get(x) for x in schema})
#Use an OrderedDict to convert your data to a dataframe
#This ensures data ends up in the right column
df = rdd_complete.map(lambda r: Row(**OrderedDict(sorted(r.items())))).toDF()
df.show()
这给出了输出:
+--------------+----+-------------+
|Enrolment_Date|Freq| Group|
+--------------+----+-------------+
| 2008-01-01| 78| null|
| 2008-02-01|null|Recorded Data|
| null| 70|Recorded Data|
| 2008-04-01| 96| null|
+--------------+----+-------------+