我正在使用pyspark读取下面的json文件:
{
"data": {
"indicatr": {
"indicatr": {
"id": "5c9e41e4884db700desdaad8"}}}}
我编写了以下python代码:
from pyspark.sql import Window, DataFrame
from pyspark.sql.types import *
from pyspark.sql.types import StructType
from pyspark.sql import functions as F
schema = StructType([
StructField("data", StructType([
StructField("indicatr", StructType([
StructField("indicatr", StructType([
StructField("id", StringType())
]))]))]))])
df = spark.read.json("pathtofile/test.json", multiLine=True)
df.show()
df2 = df.withColumn("json", F.col("data").cast("string"))
df3=df2.select(F.col("json"))
df3.collect()
df4 =df3.select(F.from_json(F.col("json"), schema).alias("name"))
df4.show()
我得到以下结果:
|name|
+----+
|null|
请知道如何解决此问题的任何人
答案 0 :(得分:0)
选择标记为json
的列时,您选择的列完全是StringType
(从逻辑上讲,因为您将其强制转换为该类型)。尽管它看起来像一个有效的JSON对象,但实际上只是一个字符串。 df2.data
却没有该问题:
In [2]: df2.printSchema()
root
|-- data: struct (nullable = true)
| |-- indicatr: struct (nullable = true)
| | |-- indicatr: struct (nullable = true)
| | | |-- id: double (nullable = true)
|-- json: string (nullable = true)
顺便说一句,您可以立即在读取时传递模式:
In [3]: df = spark.read.json("data.json", multiLine=True, schema=schema)
...: df.printSchema()
...:
...:
root
|-- data: struct (nullable = true)
| |-- indicatr: struct (nullable = true)
| | |-- indicatr: struct (nullable = true)
| | | |-- id: string (nullable = true)
您可以在列中进行挖掘以获取嵌套值:
In [4]: df.select(df.data.indicatr.indicatr.id).show()
+-------------------------+
|data.indicatr.indicatr.id|
+-------------------------+
| 5c9e41e4884db700desdaad8|
+-------------------------+