我正在尝试以JSON格式写出PySpark DataFrame(DF)。 DF有一些带有NAN值的行。我正在使用以下方法写出DF
DF.coalesce(1).write.format('json').mode('overwrite').save('myDest/' + ext)
输出JSON会忽略没有值的键。
这是一个示例:
{"id":"890226","dt":"2018-01 14T17:05:00.000Z","key":2.9427571,"anotherkey":3}
{"id":"890226","dt":"2018-01-14T17:10:00.000Z","key":2.9815376,"anotherkey":3}
{"id":"890226","dt":"2018-01-14T17:15:00.000Z","key":2.94226,"anotherkey":3}
{"id":"890226","dt":"2018-01-14T17:20:00.000Z","anotherkey":1}
{"id":"890226","dt":"2018-01-14T17:25:00.000Z","anotherkey":1}
{"id":"890226","dt":"2018-01-14T17:30:00.000Z","anotherkey":1}
{"id":"890226","dt":"2018-01-14T17:35:00.000Z","anotherkey":1}
如最后4条所示,生成的JSON跳过了'key'属性,因为在DF中,其值为NAN
在Panadas数据框中,有一个选项可以将NAN保留为key = None
有没有办法在PySpark DF中保存Nan?