如何将 dict 列从 pyspark DF 转换/分解为行

时间:2021-07-19 09:58:37

标签: python json apache-spark dictionary pyspark

我有以下结构的 json 文件:

{
    "name": {
        "0": "name1",
        "1": "name2",
        "2": "name3"
    },
    "id": {
        "0": "00001",
        "1": "00002",
        "2": "00013"
    }
}

当我读取这个 json 文件来触发 DF(使用 python) 我在每列收到带有字典的 DF:

schema  = StructType([
      StructField("name",StringType(),True),
      StructField("id",StringType(),True)
  ])
spark_df = spark.read.schema(schema).json('path_to_json_file', multiLine=True)
spark_df.show()
+-------------------------------------+-------------------------------------+
|      name                           |           id                        |
+-------------------------------------+-------------------------------------+
|{"0":"name1","1":"name2","2":"name3"}|{"0":"00001","1":"00002","2":"00013"}|
+-------------------------------------+-------------------------------------+

如何分解每列以仅包含值:

+------+-----+
| name | id  |
+------+-----+
|name1 |00001|
+------+-----+
|name2 |00002|
+------+-----+
|name3 |00013|
+------+-----+

我尝试使用 explode 函数但收到错误:

from pyspark.sql import functions as f
spark_df.select('*', f.explode('id').alias('id')).show()

raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "cannot resolve 'explode(`id`)' due to data type mismatch: input to function explode should be array or map type, not string;;\n'Project [name#860, id#861, explode(id#860) AS id#947]\n+- Relation[name#860,id#861] json\n"

我也尝试过 from_json 函数,但为此我必须定义一个内部架构,这是我无法做到的,因为值的数量是未知的。我尝试了这个架构,但只收到了空值。

schema = StructType([StructField('key1', StringType(), True)])

基本上我所知道的只是上层键名(应该成为字段名),但我将获得的记录数未知。

2 个答案:

答案 0 :(得分:1)

首先,您的输入架构是错误的。用 MapType 更改它:

schm= T.StructType(
    [
        T.StructField("name", T.MapType(T.StringType(), T.StringType()), True),
        T.StructField("id", T.MapType(T.StringType(), T.StringType()), True),
    ]
)
df = spark.read.schema(schm).json("path_to_json_file", multiLine=True)

df.printSchema()
root
 |-- json: struct (nullable = true)
 |    |-- name: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- id: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)

然后,假设 nameid 具有相同数量的输入:

df.withColumn("key", F.explode(F.map_keys("json.name"))).select(
    F.col("json.name").getItem(F.col("key")).alias("name"),
    F.col("json.id").getItem(F.col("key")).alias("id"),
).show()

+-----+-----+
| name|   id|
+-----+-----+
|name1|00001|
|name2|00002|
|name3|00013|
+-----+-----+

答案 1 :(得分:0)

谢谢@Steven 和@Alex Ott! 根据您的建议,这对我有用:

schema = T.StructType(
    [
        T.StructField("name", T.MapType(T.StringType(), T.StringType()), True),
        T.StructField("id", T.MapType(T.StringType(), T.StringType()), True),
    ]
)
spark_df = spark.read.schema(schema).json("path_to_json_file", multiLine=True)

spark_df.withColumn(
    "name", F.explode(F.map_values("name"))
).withColumn(
    "id", F.explode(F.map_values("id"))
).select("name", "id").show()

+-----+-----+
| name|   id|
+-----+-----+
|name1|00001|
|name2|00002|
|name3|00013|
+-----+-----+