将嵌套的字典键值转换为pyspark数据框

时间:2020-07-20 22:15:08

标签: python json dictionary pyspark

我有一个如下所示的Pyspark数据框:

enter image description here

我想在“ dic”列中提取这些嵌套的字典,并将其转换为PySpark数据框。像这样:

enter image description here

请让我知道如何实现这一目标。

谢谢!

1 个答案:

答案 0 :(得分:2)

from pyspark.sql import functions as F

df.show() #sample dataframe

+---------+----------------------------------------------------------------------------------------------------------+
|timestmap|dic                                                                                                       |
+---------+----------------------------------------------------------------------------------------------------------+
|timestamp|{"Name":"David","Age":"25","Location":"New York","Height":"170","fields":{"Color":"Blue","Shape":"round"}}|
+---------+----------------------------------------------------------------------------------------------------------+

对于 Spark2.4+ ,您可以使用 from_json schema_of_json

schema=df.select(F.schema_of_json(df.select("dic").first()[0])).first()[0]


df.withColumn("dic", F.from_json("dic", schema))\
  .selectExpr("dic.*").selectExpr("*","fields.*").drop("fields").show()

#+---+------+--------+-----+-----+-----+
#|Age|Height|Location| Name|Color|Shape|
#+---+------+--------+-----+-----+-----+
#| 25|   170|New York|David| Blue|round|
#+---+------+--------+-----+-----+-----+

如果您没有 rdd read.json spark2.4 一起使用>。 df to rdd 转化将对性能产生影响。

df1 = spark.read.json(df.rdd.map(lambda r: r.dic))\
   
df1.select(*[x for x in df1.columns if x!='fields'], F.col("fields.*")).show()

#+---+------+--------+-----+-----+-----+
#|Age|Height|Location| Name|Color|Shape|
#+---+------+--------+-----+-----+-----+
#| 25|   170|New York|David| Blue|round|
#+---+------+--------+-----+-----+-----+