将structtype中的所有字段转换为数组

时间:2017-09-29 11:42:41

标签: python apache-spark pyspark

我有超过1000个字段的结构类型,每个字段类型都是一个字符串。

root
 |-- mac: string (nullable = true)
 |-- kv: struct (nullable = true)
 |    |-- FTP_SERVER_ANAUTHORIZED_FEAT_B64: string (nullable = true)
 |    |-- FTP_SERVER_ANAUTHORIZED_FEAT_CODE: string (nullable = true)
 |    |-- FTP_SERVER_ANAUTHORIZED_HELP_B64: string (nullable = true)
 |    |-- FTP_SERVER_ANAUTHORIZED_HELP_CODE: string (nullable = true)
 |    |-- FTP_SERVER_ANAUTHORIZED_SYST_B64: string (nullable = true)
 |    |-- FTP_SERVER_ANAUTHORIZED_SYST_CODE: string (nullable = true)
 |    |-- FTP_SERVER_HELLO_B64: string (nullable = true)
 |    |-- FTP_STATUS_HELLO_CODE: string (nullable = true)
 |    |-- HTML_LOGIN_FORM_ACTION_0: string (nullable = true)
 |    |-- HTML_LOGIN_FORM_DETECTION_0: string (nullable = true)
 |    |-- HTML_LOGIN_FORM_INPUT_PASSWORD_NAME_0: string (nullable = true)
 |    |-- HTML_LOGIN_FORM_INPUT_TEXT_NAME_0: string (nullable = true)
 |    |-- HTML_LOGIN_FORM_METHOD_0: string (nullable = true)
 |    |-- HTML_REDIRECT_TYPE_0: string (nullable = true)

我想只选择非null的字段,以及哪些字段为非null的标识符。反正有没有明确引用每个元素将此结构转换为数组?

1 个答案:

答案 0 :(得分:1)

我使用udf

from pyspark.sql.types import *
from pyspark.sql.functions import udf

as_array = udf(
    lambda arr: [x for x in arr if x is not None], 
    ArrayType(StringType()))


df.withColumn("arr", as_array(df["kv"])))