我的数据框如下所示。我需要从输入数组类型列中提取值。你能告诉我怎样才能在pyspark中实现这个目标。
None
root
|-- input: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
|-- A: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
|-- B: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
|-- C: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
|-- D: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
|-- E: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
|-- timestamp: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
答案 0 :(得分:0)
希望这有帮助!
from itertools import chain
df.select('input').rdd.flatMap(lambda x: chain(*(x))).map(lambda x: x.values()).collect()