我正在尝试从嵌套的JSON (具有动态模式)中提取某些参数,并使用pyspark生成spark数据帧。
我的代码对于1级(key:value)完美地工作,但是无法为嵌套JSON的每个(key:value)对获取独立的列。
注意-这不是确切的架构。只是为了给出模式的嵌套性质的想法
{
"tweet": {
"text": "RT @author original message"
"user": {
"screen_name": "Retweeter"
},
"retweeted_status": {
"text": "original message".
"user": {
"screen_name": "OriginalTweeter"
},
"place": {
},
"entities": {
},
"extended_entities": {
}
},
},
"entities": {
},
"extended_entities": {
}
}
}
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField("text", StringType(), True),
StructField("created_at", StringType(), True),
StructField("retweeted_status", StructType([
StructField("text", StringType(), True),
StructField("created_at", StringType(), True)]))
])
df = spark.read.schema(schema).json("/user/sagarp/NaMo/data/NaMo2019-02-12_00H.json")
df.show()
嵌套的 retweet_status JSON下的所有(键:值)被压缩为1个单个列表。例如[文字,created_at,实体]
+--------------------+--------------------+--------------------+
| text| created_at| retweeted_status|
+--------------------+--------------------+--------------------+
|RT @Hoosier602: @...|Mon Feb 11 19:04:...|[@CLeroyjnr @Gabr...|
|RT @EgSophie: Oh ...|Mon Feb 11 19:04:...|[Oh cool so do yo...|
|RT @JacobAWohl: @...|Mon Feb 11 19:04:...|[@realDonaldTrump...|
我希望每个键都有独立的列。另外,请注意,您已经有一个同名 text 的父级键。您将如何处理此类实例?
理想情况下,我想要诸如“文本”,“实体”,“ retweet_status_text”,“ retweet_status_entities”等列
答案 0 :(得分:1)
您的架构未正确映射...如果您要手动构建架构,请参阅以下帖子(如果数据不变,建议您这样做):
PySpark: How to Update Nested Columns?
https://docs.databricks.com/_static/notebooks/complex-nested-structured.html
此外,如果您的JSON是多行的(例如您的示例),则可以...
! cat nested.json
[
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
{
"string": "string3",
"int": 3,
"array": [
3,
6,
9
],
"dict": {
"key": "value3",
"extra_key": "extra_value3"
}
}
]
getSchema = spark.read.option("multiline", "true").json("nested.json")
extractSchema = getSchema.schema
print(extractSchema)
StructType(List(StructField(array,ArrayType(LongType,true),true),StructField(dict,StructType(List(StructField(extra_key,StringType,true),StructField(key,StringType,true))),true),StructField(int,LongType,true),StructField(string,StringType,true)))
loadJson = spark.read.option("multiline", "true").schema(extractSchema ).json("nested.json")
loadJson.printSchema()
root
|-- array: array (nullable = true)
| |-- element: long (containsNull = true)
|-- dict: struct (nullable = true)
| |-- extra_key: string (nullable = true)
| |-- key: string (nullable = true)
|-- int: long (nullable = true)
|-- string: string (nullable = true)
loadJson.show(truncate=False)
+---------+----------------------+---+-------+
|array |dict |int|string |
+---------+----------------------+---+-------+
|[1, 2, 3]|[, value1] |1 |string1|
|[2, 4, 6]|[, value2] |2 |string2|
|[3, 6, 9]|[extra_value3, value3]|3 |string3|
+---------+----------------------+---+-------+
一旦您加载了正确的映射数据,就可以开始通过嵌套列的“点”表示法和“分解”到扁平化数组等方式转换为规范化的模式。
loadJson\
.selectExpr("dict.key as key", "dict.extra_key as extra_key").show()
+------+------------+
| key| extra_key|
+------+------------+
|value1| null|
|value2| null|
|value3|extra_value3|
+------+------------+