如何在不定义Pyspark模式的情况下将非结构化RDD转换为数据帧?

时间:2017-03-24 06:52:34

标签: pyspark

我在kafka流媒体之后得到了这个RDD。我想在不定义Schema的情况下将其转换为数据帧。

[
 {u'Enrolment_Date': u'2008-01-01', u'Freq': 78}, 
 {u'Enrolment_Date': u'2008-02-01', u'Group': u'Recorded Data'}, 
 {u'Freq': 70, u'Group': u'Recorded Data'}, 
 {u'Enrolment_Date': u'2008-04-01', u'Freq': 96}
 ]

1 个答案:

答案 0 :(得分:0)

您可以使用OrderedDict将包含键值对的RDD转换为数据帧。但是,在您的情况下,并非所有键都存在于每一行中,因此您需要首先使用None值填充这些键。请参阅下面的解决方案

#Define test data
data = [{u'Enrolment_Date': u'2008-01-01', u'Freq': 78}, {u'Enrolment_Date': u'2008-02-01', u'Group': u'Recorded Data'}, {u'Freq': 70, u'Group': u'Recorded Data'}, {u'Enrolment_Date': u'2008-04-01', u'Freq': 96}]
rdd = sc.parallelize(data)

from pyspark.sql import Row 
from collections import OrderedDict

#Determine all the keys in the input data 
schema = rdd.flatMap(lambda x: x.keys()).distinct().collect()
#Add missing keys with a None value
rdd_complete= rdd.map(lambda r:{x:r.get(x) for x in schema})

#Use an OrderedDict to convert your data to a dataframe
#This ensures data ends up in the right column
df = rdd_complete.map(lambda r: Row(**OrderedDict(sorted(r.items())))).toDF()
df.show()

这给出了输出:

+--------------+----+-------------+
|Enrolment_Date|Freq|        Group|
+--------------+----+-------------+
|    2008-01-01|  78|         null|
|    2008-02-01|null|Recorded Data|
|          null|  70|Recorded Data|
|    2008-04-01|  96|         null|
+--------------+----+-------------+