在pyspark嵌套的Json数据框中识别和删除重复的列

时间:2020-06-14 19:06:34

标签: python json pyspark pyspark-dataframes

我有一个具有以下架构的dataFrame:

 |-- nlucontexttrail: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- agentid: string (nullable = true)
 |    |    |-- intent: struct (nullable = true)
 |    |    |    |-- confidence: double (nullable = true)
 |    |    |    |-- entities: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |    |-- values: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- literal: string (nullable = true)
 |    |    |    |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- intentname: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |-- intentcandidates: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- confidence: double (nullable = true)
 |    |    |    |    |-- entities: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |    |    |-- values: array (nullable = true)
 |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |-- literal: string (nullable = true)
 |    |    |    |    |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |    |-- intentname: string (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |-- modelid: string (nullable = true)
 |    |    |-- modelversion: long (nullable = true)
 |    |    |-- nlusessionid: string (nullable = true)
 |    |    |-- usednluengine: string (nullable = true)
 |    |    |-- usednluengine: string (nullable = true)

如果所有人都可以看到突出显示的重复列(“ usednluengine” ),则其中之一的值为“无”,而另一列的值为预期值。现在,我要删除具有“无”值的列。我也在下面共享数据,请仔细检查。

[{"agentid":"dispatcher","intent":{"confidence":0.8822699,"entities":[{"id":"duration","values":[{"literal":"2 Sekunden","value":"PT2S"}]},{"id":"date","values":[{"literal":"eins","value":"T23:00:00Z"},{"literal":"eins","value":"T23:00:00Z"}]},{"id":"number","values":[{"literal":"eins","value":"1"},{"literal":"2","value":"2"},{"literal":"eins","value":"1"}]},{"id":"station","values":[{"literal":"eins","value":"eins"},{"literal":"eins","value":"eins"}]},{"id":"number_values","values":[{"literal":"eins","value":"1"},{"literal":"eins","value":"1"}]},{"id":"percentage_values","values":[{"literal":"höchsten","value":"100"}]}],"intentname":null,"name":"TV"},"intentcandidates":[{"confidence":0.8822699,"entities":[{"id":"duration","values":[{"literal":"2 Sekunden","value":"PT2S"}]},{"id":"date","values":[{"literal":"eins","value":"T23:00:00Z"},{"literal":"eins","value":"T23:00:00Z"}]},{"id":"number","values":[{"literal":"eins","value":"1"},{"literal":"2","value":"2"},{"literal":"eins","value":"1"}]},{"id":"station","values":[{"literal":"eins","value":"eins"},{"literal":"eins","value":"eins"}]},{"id":"number_values","values":[{"literal":"eins","value":"1"},{"literal":"eins","value":"1"}]},{"id":"percentage_values","values":[{"literal":"höchsten","value":"100"}]}],"intentname":null,"name":"TV"}],"modelid":"SVH_STAGING__DISPATCHER","modelversion":13,"nlusessionid":null,"usednluengine":"luis"},{"agentid":"dispatcher","intent":{"confidence":0.140685484,"entities":[{"id":"duration","values":[{"literal":"2 Sekunden","value":"PT2S"}]},{"id":"date","values":[{"literal":"eins","value":"T23:00:00Z"},{"literal":"eins","value":"T23:00:00Z"}]},{"id":"number","values":[{"literal":"eins","value":"1"},{"literal":"2","value":"2"},{"literal":"eins","value":"1"}]},{"id":"number_values","values":[{"literal":"eins","value":"1"},{"literal":"eins","value":"1"}]},{"id":"percentage_values","values":[{"literal":"höchsten","value":"100"}]}],"intentname":null,"name":"TV__SWITCH_CHANNEL"},"intentcandidates":[{"confidence":0.140685484,"entities":[{"id":"duration","values":[{"literal":"2 Sekunden","value":"PT2S"}]},{"id":"date","values":[{"literal":"eins","value":"T23:00:00Z"},{"literal":"eins","value":"T23:00:00Z"}]},{"id":"number","values":[{"literal":"eins","value":"1"},{"literal":"2","value":"2"},{"literal":"eins","value":"1"}]},{"id":"number_values","values":[{"literal":"eins","value":"1"},{"literal":"eins","value":"1"}]},{"id":"percentage_values","values":[{"literal":"höchsten","value":"100"}]}],"intentname":null,"name":"TV__SWITCH_CHANNEL"}],"modelid":"SVH_STAGING__TV","modelversion":13,"nlusessionid":null,"usednluengine":"luis"}]   

您可以将以下数据放在下面的链接中,以正确的格式查看: http://jsonviewer.stack.hu/

要注意的一点是重复列,其值为“ None”将在数据中不可见,但实际上它在df.printSchema中可用,我想删除所有重复列/嵌套列(它们是内部的一部分struct),并保留具有值的列。我的意思是数据没有变化,但实际上是架构发生了变化

希望我能解决我的问题。如果没有,请在下面发表评论以进行进一步的讨论。

0 个答案:

没有答案