我有像这样的JSON输入
{ "1": { "id": 1, "value": 5 }, "2": { "id": 2, "list": { "10": { "id": 10 }, "11": { "id": 11 }, "20": { "id": 20 } } }, "3": { "id": 3, "key": "a" } }
我需要合并3列并为每列提取所需的值,这就是我需要的输出:
{ "out": { "1": 5, "2": [10, 11, 20], "3": "a" } }
我尝试创建一个UDF来将这3列转换为1,但我无法弄清楚如何使用混合值类型定义MapType() - IntegerType()
,ArrayType(IntegerType())
和StringType()
分别
提前致谢!
答案 0 :(得分:2)
您需要使用StructType
定义UDF的结果类型,而不是MapType
,如下所示:
from pyspark.sql.types import *
udf_result = StructType([
StructField('1', IntegerType()),
StructField('2', ArrayType(StringType())),
StructField('3', StringType())
])
答案 1 :(得分:1)
MapType()
用于(键,值)对定义,不用于嵌套数据帧。您正在寻找的是StructType()
您可以使用createDataFrame
直接加载它,但是您必须传递架构,因此这种方式更容易:
import json
data_json = {
"1": {
"id": 1,
"value": 5
},
"2": {
"id": 2,
"list": {
"10": {
"id": 10
},
"11": {
"id": 11
},
"20": {
"id": 20
}
}
},
"3": {
"id": 3,
"key": "a"
}
}
a=[json.dumps(data_json)]
jsonRDD = sc.parallelize(a)
df = spark.read.json(jsonRDD)
df.printSchema()
root
|-- 1: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- value: long (nullable = true)
|-- 2: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- list: struct (nullable = true)
| | |-- 10: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | |-- 11: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | |-- 20: struct (nullable = true)
| | | |-- id: long (nullable = true)
|-- 3: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- key: string (nullable = true)
现在访问嵌套的数据帧。请注意列" 2"比其他嵌套更嵌套:
nested_cols = ["2"]
cols = ["1", "3"]
import pyspark.sql.functions as psf
df = df.select(
cols + [psf.array(psf.col(c + ".list.*")).alias(c) for c in nested_cols]
)
df = df.select(
[df[c].id.alias(c) for c in df.columns]
)
root
|-- 1: long (nullable = true)
|-- 3: long (nullable = true)
|-- 2: array (nullable = false)
| |-- element: long (containsNull = true)
由于您希望将其嵌套在"out"
列中,因此它不完全是您的最终输出:
import pyspark.sql.functions as psf
df.select(psf.struct("*").alias("out")).printSchema()
root
|-- out: struct (nullable = false)
| |-- 1: long (nullable = true)
| |-- 3: long (nullable = true)
| |-- 2: array (nullable = false)
| | |-- element: long (containsNull = true)
最后回到JSON:
df.toJSON().first()
'{"1":1,"3":3,"2":[10,11,20]}'