具有混合值类型

时间:2017-09-02 05:36:28

标签: pyspark pyspark-sql

我有像这样的JSON输入


    {
      "1": {
        "id": 1,
        "value": 5
      },
      "2": {
        "id": 2,
        "list": {
          "10": {
            "id": 10
          },
          "11": {
            "id": 11
          },
          "20": {
            "id": 20
          }
        }
      },
      "3": {
        "id": 3,
        "key": "a"
      }
    }

我需要合并3列并为每列提取所需的值,这就是我需要的输出:


    {
      "out": {
        "1": 5,
        "2": [10, 11, 20],
        "3": "a"
      }
    }

我尝试创建一个UDF来将这3列转换为1,但我无法弄清楚如何使用混合值类型定义MapType() - IntegerType()ArrayType(IntegerType())StringType()分别

提前致谢!

2 个答案:

答案 0 :(得分:2)

您需要使用StructType定义UDF的结果类型,而不是MapType,如下所示:

from pyspark.sql.types import *

udf_result = StructType([
    StructField('1', IntegerType()),
    StructField('2', ArrayType(StringType())),
    StructField('3', StringType())
])

答案 1 :(得分:1)

MapType()用于(键,值)对定义,不用于嵌套数据帧。您正在寻找的是StructType()

您可以使用createDataFrame直接加载它,但是您必须传递架构,因此这种方式更容易:

import json

data_json = {
      "1": {
        "id": 1,
        "value": 5
      },
      "2": {
        "id": 2,
        "list": {
          "10": {
            "id": 10
          },
          "11": {
            "id": 11
          },
          "20": {
            "id": 20
          }
        }
      },
      "3": {
        "id": 3,
        "key": "a"
      }
    }
a=[json.dumps(data_json)]
jsonRDD = sc.parallelize(a)
df = spark.read.json(jsonRDD)
df.printSchema()

    root
     |-- 1: struct (nullable = true)
     |    |-- id: long (nullable = true)
     |    |-- value: long (nullable = true)
     |-- 2: struct (nullable = true)
     |    |-- id: long (nullable = true)
     |    |-- list: struct (nullable = true)
     |    |    |-- 10: struct (nullable = true)
     |    |    |    |-- id: long (nullable = true)
     |    |    |-- 11: struct (nullable = true)
     |    |    |    |-- id: long (nullable = true)
     |    |    |-- 20: struct (nullable = true)
     |    |    |    |-- id: long (nullable = true)
     |-- 3: struct (nullable = true)
     |    |-- id: long (nullable = true)
     |    |-- key: string (nullable = true)

现在访问嵌套的数据帧。请注意列" 2"比其他嵌套更嵌套:

nested_cols = ["2"]
cols = ["1", "3"]
import pyspark.sql.functions as psf
df = df.select(
    cols + [psf.array(psf.col(c + ".list.*")).alias(c) for c in nested_cols]
)
df = df.select(
    [df[c].id.alias(c) for c in df.columns]
)

    root
     |-- 1: long (nullable = true)
     |-- 3: long (nullable = true)
     |-- 2: array (nullable = false)
     |    |-- element: long (containsNull = true)

由于您希望将其嵌套在"out"列中,因此它不完全是您的最终输出:

import pyspark.sql.functions as psf
df.select(psf.struct("*").alias("out")).printSchema()

    root
     |-- out: struct (nullable = false)
     |    |-- 1: long (nullable = true)
     |    |-- 3: long (nullable = true)
     |    |-- 2: array (nullable = false)
     |    |    |-- element: long (containsNull = true)

最后回到JSON:

df.toJSON().first()

    '{"1":1,"3":3,"2":[10,11,20]}'