我有超过1000个字段的结构类型,每个字段类型都是一个字符串。
root
|-- mac: string (nullable = true)
|-- kv: struct (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_FEAT_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_FEAT_CODE: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_HELP_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_HELP_CODE: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_SYST_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_SYST_CODE: string (nullable = true)
| |-- FTP_SERVER_HELLO_B64: string (nullable = true)
| |-- FTP_STATUS_HELLO_CODE: string (nullable = true)
| |-- HTML_LOGIN_FORM_ACTION_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_DETECTION_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_INPUT_PASSWORD_NAME_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_INPUT_TEXT_NAME_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_METHOD_0: string (nullable = true)
| |-- HTML_REDIRECT_TYPE_0: string (nullable = true)
我想只选择非null的字段,以及哪些字段为非null的标识符。反正有没有明确引用每个元素将此结构转换为数组?
答案 0 :(得分:1)
我使用udf
:
from pyspark.sql.types import *
from pyspark.sql.functions import udf
as_array = udf(
lambda arr: [x for x in arr if x is not None],
ArrayType(StringType()))
df.withColumn("arr", as_array(df["kv"])))