如何在pyspark中将JSON字符串转换为JSON对象

时间:2018-04-11 10:26:22

标签: json pyspark spark-dataframe pyspark-sql

我有一个列类型的数据框是字符串,但实际上它包含4个架构的json对象,其中很少有字段是常见的。我需要将其转换为jason对象。

这是数据框架图:

  

query.printSchema()

root
 |-- test: string (nullable = true)

DF的值看起来像

  

query.show(10)

+--------------------+
|                test|
+--------------------+
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
+--------------------+
only showing top 10 rows

我申请了什么解决方案::

  1. 写入文本文件
  2.   

    query.write.format( “文本”).mode( '覆盖')保存( “S3:// bucketname /温度/”)。

    1. 读为json
    2.   

      df = spark.read.json(“s3a:// bucketname / temp /”)

      1. 现在打印Schema,它是已经转换为json对象的每一行的json字符串
      2.   

        df.printSchema()

        root
         |-- EventDate: string (nullable = true)
         |-- EventId: string (nullable = true)
         |-- EventNotificationType: long (nullable = true)
         |-- Interaction: struct (nullable = true)
         |    |-- ContextId: string (nullable = true)
         |    |-- Created: string (nullable = true)
         |    |-- Description: string (nullable = true)
         |    |-- Id: string (nullable = true)
         |    |-- ModelContextId: string (nullable = true)
         |-- PurchaseActivity: struct (nullable = true)
         |    |-- BillingCity: string (nullable = true)
         |    |-- BillingCountry: string (nullable = true)
         |    |-- ShippingAndHandlingAmount: double (nullable = true)
         |    |-- ShippingDiscountAmount: double (nullable = true)
         |    |-- SubscriberId: long (nullable = true)
         |    |-- SubscriptionOriginalEndDate: string (nullable = true)
         |-- SubscriptionChurn: struct (nullable = true)
         |    |-- PaymentTypeCode: long (nullable = true)
         |    |-- PaymentTypeName: string (nullable = true)
         |    |-- PreviousPaidAmount: double (nullable = true)
         |    |-- SubscriptionRemoved: string (nullable = true)
         |    |-- SubscriptionStartDate: string (nullable = true)
         |-- TransactionDetail: struct (nullable = true)
         |    |-- Amount: double (nullable = true)
         |    |-- OrderShipToCountry: string (nullable = true)
         |    |-- PayPalUserName: string (nullable = true)
         |    |-- PaymentSubTypeCode: long (nullable = true)
         |    |-- PaymentSubTypeName: string (nullable = true)
        

        有没有最好的方法,我不需要将数据帧写为文本文件,并再次将其作为json文件读取,以获得预期的输出

1 个答案:

答案 0 :(得分:0)

在写入文本文件之前,可以使用from_json(),但需要先定义架构。

代码如下:

data = query.select(from_json("test",schema=schema).alias("value")).selectExpr("value.*")

data.write.format("text").mode('overwrite').save("s3://bucketname/temp/")