如何在Spark Scala上删除结构列数组上的元素

时间:2019-08-02 15:55:29

标签: scala dataframe apache-spark

我具有以下DataFrame:

+------------------------+--------------------+---+---+----------+----------------------------------------------------------------------------------------------+
|_id                     |h                   |inc|op |ts        |webhooks                                                                                      |
+------------------------+--------------------+---+---+----------+----------------------------------------------------------------------------------------------+
|5926115bffecf947d9fdf965|-3783513890158363801|148|u  |1564077339|[[5,,,], [1, 2019-07-25 17:55:39.813,, 2019-07-25 17:55:39.819], [0,,,], [2,,,], [3,,,]]      |
|5926115bffecf947d9fdf965|-6421919050082865687|151|u  |1564077339|[[5,,,], [1, 2019-07-25 17:55:39.822,, 2019-07-25 17:55:39.845], [0,,,], [2,,,], [3,,,]]      |
|5926115bffecf947d9fdf965|-1953717027542703837|155|u  |1564077339|[[5,,,], [1, 2019-07-25 17:55:39.873,, 2019-07-25 17:55:39.878], [0,,,], [2,,,], [3,,,]]      |
|5926115bffecf947d9fdf965|7260191374440479618 |159|u  |1564077339|[[5,,,], [1, 2019-07-25 17:55:39.945,, 2019-07-25 17:55:39.951], [0,,,], [2,,,], [3,,,]]      |
|57d17de901cc6a6c9e0000ab|-2430099739381353477|131|u  |1564077339|[[5,,,], [1,,,], [0, 2019-07-25 17:55:39.722, error, 2019-07-25 17:55:39.731], [2,,,], [3,,,]]|
|5b9bf21bffecf966c2878b11|4122669520839049341 |30 |u  |1564077341|[[5,,,], [1,,,], [0,, listening, 2019-07-25 17:55:41.453], [2,,,], [3,,,]]                    |
|5b9bf21bffecf966c2878b11|4122669520839049341 |30 |u  |1564077341|[[5,,,], [1,,,], [0,, listening, 2019-07-25 17:55:41.453], [2,,,], [3,,,]]                    |
|5b9bf21bffecf966c2878b11|-7191334145177061427|60 |u  |1564077341|[[5,,,], [1,,,], [0,,, 2019-07-25 17:55:41.768], [2,,,], [3,,,]]                              |
|5b9bf21bffecf966c2878b11|1897433358396319399 |58 |u  |1564077341|[[5,,,], [1,,,], [0,,, 2019-07-25 17:55:41.767], [2,,,], [3,,,]]                              |
|5b9bf21bffecf966c2878b11|1897433358396319399 |58 |u  |1564077341|[[5,,,], [1,,,], [0,,, 2019-07-25 17:55:41.767], [2,,,], [3,,,]]                              |
|58c6d048edbb6e09eb177639|8363076784039152000 |23 |u  |1564077342|[[5,,,], [1,,,], [0,,, 2019-07-25 17:55:42.216], [2,,,], [3,,,]]                              |
|5b9bf21bffecf966c2878b11|-7191334145177061427|60 |u  |1564077341|[[5,,,], [1,,,], [0,,, 2019-07-25 17:55:41.768], [2,,,], [3,,,]]                              |
|58c6d048edbb6e09eb177639|8363076784039152000 |23 |u  |1564077342|[[5,,,], [1,,,], [0,,, 2019-07-25 17:55:42.216], [2,,,], [3,,,]]                              |
|5ac6a0d3b795b013a5a73a43|-3790832816225805697|36 |u  |1564077346|[[5,,,], [1,,,], [0,,,], [2, 2019-07-25 17:55:46.384,, 2019-07-25 17:55:46.400], [3,,,]]      |
|5ac6a0d3b795b013a5a73a43|-1747137668935062717|34 |u  |1564077346|[[5,,,], [1,,,], [0,,,], [2, 2019-07-25 17:55:46.385,, 2019-07-25 17:55:46.398], [3,,,]]      |
|5ac6a0d3b795b013a5a73a43|-1747137668935062717|34 |u  |1564077346|[[5,,,], [1,,,], [0,,,], [2, 2019-07-25 17:55:46.385,, 2019-07-25 17:55:46.398], [3,,,]]      |
|5ac6a0d3b795b013a5a73a43|-3790832816225805697|36 |u  |1564077346|[[5,,,], [1,,,], [0,,,], [2, 2019-07-25 17:55:46.384,, 2019-07-25 17:55:46.400], [3,,,]]      |
|5ac6a0d3b795b013a5a73a43|6060575882395080442 |63 |u  |1564077346|[[5,,,], [1,,,], [0,,,], [2, 2019-07-25 17:55:46.506,, 2019-07-25 17:55:46.529], [3,,,]]      |
|5ac6a0d3b795b013a5a73a43|6060575882395080442 |63 |u  |1564077346|[[5,,,], [1,,,], [0,,,], [2, 2019-07-25 17:55:46.506,, 2019-07-25 17:55:46.529], [3,,,]]      |
|594e88f1ffecf918a14c143e|736029767610412482  |58 |u  |1564077346|[[5,,,], [1,,,], [0, 2019-07-25 17:55:46.503,, 2019-07-25 17:55:46.513], [2,,,], [3,,,]]      |
+------------------------+--------------------+---+---+----------+----------------------------------------------------------------------------------------------+

具有以下架构:

root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- index: string (nullable = false)
 |    |    |-- failed_at: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- updated_at: string (nullable = true)

在webhooks列上,我有一些元素只有一个项目:

[[5,,,], [1, 2019-07-25 17:55:39.813,, 2019-07-25 17:55:39.819], [0,,,], [2,,,], [3,,,]]

我该怎么做才能删除只有一个数字的元素,这样我就可以在每一行上添加类似的内容:

[[1, 2019-07-25 17:55:39.813,, 2019-07-25 17:55:39.819]]
[[1, 2019-07-25 17:55:39.822,, 2019-07-25 17:55:39.845]] 

谢谢。

1 个答案:

答案 0 :(得分:0)

首先,爆炸您的webhooks,例如

df.withColumn("webhooks", explode($"webhooks"))

使数组元素进入每一行。然后,像这样

df.where(col("webhooks").getItem("failed_at").isNotNull || col("webhooks").getItem("status").isNotNull || col("webhooks").getItem("updated_at").isNotNull)

它不会给出结果,因为我无法测试您的数据框,但是您可以引用我的代码并获得所需的结果。