更新 我看到有一个类似的问题here。我的情况略有不同,因为我有一个Structs数组,并且Array / Struct中有更多列。我大约有30列,其中我只对两列感兴趣。我想知道是否有一种方法可以只使用两列而不必重建或解构所有30个字段?
我有一个具有以下结构的Spark数据框:
root
|-- FIELD1: struct (nullable = true)
| |-- FIELD_2: string (nullable = true)
| |-- FIELD_3: long (nullable = true)
| |-- FIELD_4: integer (nullable = true)
|-- FIELD5: struct (nullable = true)
| |-- FIELD6: integer (nullable = true)
| |-- FIELD7: string (nullable = true)
|-- FIELD8: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FIELD9: integer (nullable = true)
| | |-- FIELD10: string (nullable = true)
|-- FIELD11: struct (nullable = true)
| |-- FIELD12: integer (nullable = true)
|-- SOME_FIELD: integer (nullable = true)
|-- OTHER_FIELD: integer (nullable = true)
我想做的是处理每个行,在FIELD9,FIELD10和FIELD 7上运行一些计算,并将这些计算的结果存储在结构列FIELD8中的新字段中。
因此,结果数据框应如下图所示。注意结构数组FIELD 8下的新字段。
root
|-- FIELD1: struct (nullable = true)
| |-- FIELD_2: string (nullable = true)
| |-- FIELD_3: long (nullable = true)
| |-- FIELD_4: integer (nullable = true)
|-- FIELD5: struct (nullable = true)
| |-- FIELD6: integer (nullable = true)
| |-- FIELD7: string (nullable = true)
|-- FIELD8: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FIELD9: integer (nullable = true)
| | |-- FIELD10: string (nullable = true)
| | |-- NEW_FIELD_HERE: string (nullable = true)
|-- FIELD11: struct (nullable = true)
| |-- FIELD12: integer (nullable = true)
|-- SOME_FIELD: integer (nullable = true)
|-- OTHER_FIELD: integer (nullable = true)
我想我可以创建一个case类来映射结构并使用map操作,但是我想知道是否有更有效的方法?
谢谢您的帮助。