如何从pyspark中的数据框中选择所有结构?

时间:2019-06-07 11:04:18

标签: pyspark

我有一个使用pyspark加载的json数据库。

我正在尝试访问其中每个结构的所有“ x”组件。

这是df.select("level_instance_json.player").printSchema()

的输出
root
 |-- player: struct (nullable = true)
 |    |-- 0: struct (nullable = true)
 |    |    |-- head_pitch: long (nullable = true)
 |    |    |-- head_roll: long (nullable = true)
 |    |    |-- head_yaw: long (nullable = true)
 |    |    |-- r: long (nullable = true)
 |    |    |-- x: long (nullable = true)
 |    |    |-- y: long (nullable = true)
 |    |-- 1: struct (nullable = true)
 |    |    |-- head_pitch: long (nullable = true)
 |    |    |-- head_roll: long (nullable = true)
 |    |    |-- head_yaw: long (nullable = true)
 |    |    |-- r: long (nullable = true)
 |    |    |-- x: long (nullable = true)
 |    |    |-- y: long (nullable = true)
...

我尝试使用“ *”选择器选择全部,但它不起作用。 df.select("level_instance_json.player.*.x").show(10)出现此错误:

'No such struct field * in 0, 1, 10, 100, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 101, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 102,...

1 个答案:

答案 0 :(得分:0)

您可以这样做:

list_player_numbers = [el.name for el in df.select("level_instance_json.player").schema['player'].dataType]
list_fields = ['.'.join(['level_instance_json', 'player', player_number, 'x']) for player_number in list_player_numbers]

output = df.select(list_fields)

应该可以。

泽维尔