用pyspark解析json数据

时间:2019-11-28 15:54:10

标签: json pyspark

我正在使用pyspark读取下面的json文件:

{
  "data": {
    "indicatr": {
      "indicatr": {
        "id": "5c9e41e4884db700desdaad8"}}}}

我编写了以下python代码:

from pyspark.sql import Window, DataFrame
from pyspark.sql.types import *
from pyspark.sql.types import StructType
from pyspark.sql import functions as F
schema  = StructType([
  StructField("data", StructType([
    StructField("indicatr", StructType([
       StructField("indicatr", StructType([
         StructField("id", StringType())

         ]))]))]))])

df = spark.read.json("pathtofile/test.json", multiLine=True)
df.show()

df2 = df.withColumn("json", F.col("data").cast("string"))

df3=df2.select(F.col("json"))
df3.collect()


df4 =df3.select(F.from_json(F.col("json"), schema).alias("name"))
df4.show()

我得到以下结果:

|name|
+----+
|null|

请知道如何解决此问题的任何人

1 个答案:

答案 0 :(得分:0)

选择标记为json的列时,您选择的列完全是StringType(从逻辑上讲,因为您将其强制转换为该类型)。尽管它看起来像一个有效的JSON对象,但实际上只是一个字符串。 df2.data却没有该问题:

In [2]: df2.printSchema()
root
 |-- data: struct (nullable = true)
 |    |-- indicatr: struct (nullable = true)
 |    |    |-- indicatr: struct (nullable = true)
 |    |    |    |-- id: double (nullable = true)
 |-- json: string (nullable = true)

顺便说一句,您可以立即在读取时传递模式:

In [3]: df = spark.read.json("data.json", multiLine=True, schema=schema)
   ...: df.printSchema()
   ...: 
   ...: 
root
 |-- data: struct (nullable = true)
 |    |-- indicatr: struct (nullable = true)
 |    |    |-- indicatr: struct (nullable = true)
 |    |    |    |-- id: string (nullable = true)

您可以在列中进行挖掘以获取嵌套值:

In [4]: df.select(df.data.indicatr.indicatr.id).show()
+-------------------------+
|data.indicatr.indicatr.id|
+-------------------------+
| 5c9e41e4884db700desdaad8|
+-------------------------+