创建管理null值的嵌套json文件

时间:2019-11-06 12:18:13

标签: python json pyspark

我正在使用pyspark,我有以下代码,该代码从一个数据框创建嵌套的json文件,其中某些字段(产品,数量,从,到)嵌套在“需求”中。在下面以创建json的一行为例的代码下

final2 = final.groupby('identifier', 'plant', 'family', 'familyDescription', 'type', 'name', 'description', 'batchSize', 'phantom', 'makeOrBuy', 'safetyStock', 'unit', 'unitPrice', 'version').agg(F.collect_list(F.struct(F.col("product"), F.col("quantity"),  F.col("from"), F.col("to"))).alias('requirements'))


{"identifier":"xxx","plant":"xxxx","family":"xxxx","familyDescription":"xxxx","type":"assembled","name":"xxxx","description":"xxxx","batchSize":20.0,"phantom":"False","makeOrBuy":"make","safetyStock":0.0,"unit":"PZ","unitPrice":xxxx,"version":"0001","requirements":[{"product":"yyyy","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"zzzz","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"kkkk","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"wwww","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"bbbb","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"}]}

final2数据框的架构如下:

 |-- identifier: string (nullable = true)
 |-- plant: string (nullable = true)
 |-- family: string (nullable = true)
 |-- familyDescription: string (nullable = true)
 |-- type: string (nullable = false)
 |-- name: string (nullable = true)
 |-- description: string (nullable = true)
 |-- batchSize: double (nullable = true)
 |-- phantom: string (nullable = false)
 |-- makeOrBuy: string (nullable = false)
 |-- safetyStock: double (nullable = true)
 |-- unit: string (nullable = true)
 |-- unitPrice: double (nullable = true)
 |-- version: string (nullable = true)
 |-- requirements: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- product: string (nullable = true)
 |    |    |-- quantity: double (nullable = true)
 |    |    |-- from: timestamp (nullable = true)
 |    |    |-- to: timestamp (nullable = true)

我遇到了一个问题,因为我必须在最终数据框中添加一些数据,其中包含产品,数量,从,到=空:使用上面的代码,我得到“需求”:[{}],但是数据库在哪里我写文件(MongoDB)的空JSON对象出错,因为它期望空值的“ requirements”:[]。

我尝试过

 import pyspark.sql.functions as F
 df = final_prova2.withColumn("requirements", 
 F.when(final_prova2.requirements.isNull(), 
 F.array()).otherwise(final_prova2.requirements))

但是它不起作用。 关于如何修改代码的任何建议?我正在努力寻找解决方案(考虑到所需的结构,我什至不知道解决方案是否可行)。

谢谢

1 个答案:

答案 0 :(得分:1)

您需要检查requirements的所有4个字段是否均为NULL,而不是列本身。解决此问题的一种方法是在创建 final2 时调整 collect_list 聚合函数:

import pyspark.sql.functions as F

final2 = final.groupby('identifier', 'plant', 'family', 'familyDescription', 'type', 'name', 'description', 'batchSize', 'phantom', 'makeOrBuy', 'safetyStock', 'unit', 'unitPrice', 'version') \
    .agg(F.expr("""
      collect_list(
        IF(coalesce(quantity, product, from, to) is NULL
          , NULL
          , struct(product, quantity, from, to)
        )
      )
    """).alias('requirements'))

位置:

  • 我们使用SQL表达式IF(condition, true_value, false_value)为collect_list设置参数

  • 条件:coalesce(quantity, product, from, to) is NULL是要测试列出的所有4列是否为NULL,如果为true,则返回 NULL ,否则返回 struct(product,Quantity,从,到)