Question

我的数据框如下所示。我需要从输入数组类型列中提取值。你能告诉我怎样才能在pyspark中实现这个目标。

None
root
 |-- input: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: map (valueContainsNull = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: double (valueContainsNull = true)
 |-- A: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: map (valueContainsNull = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: double (valueContainsNull = true)
 |-- B: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: map (valueContainsNull = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: double (valueContainsNull = true)
 |-- C: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: map (valueContainsNull = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: double (valueContainsNull = true)
 |-- D: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: map (valueContainsNull = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: double (valueContainsNull = true)
 |-- E: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: map (valueContainsNull = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: double (valueContainsNull = true)
 |-- timestamp: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: map (valueContainsNull = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: double (valueContainsNull = true)

Answer 1

希望这有帮助！

from itertools import chain
df.select('input').rdd.flatMap(lambda x: chain(*(x))).map(lambda x: x.values()).collect()

pyspark - 从dataframe获取数组类型的值

1 个答案: