Question

使用 PySpark，我有一个数据框，其架构类似于以下内容：

root
 |-- id: string
 |-- v1: string
 |-- v2: string
 |-- v3: string

我现在想选择数据并将其转换为如下内容：

root
 |-- ident: string
 |-- custom: struct
 |    |-- val1: string
 |    |-- val2: string
 |    |-- val3: string

我认为这会奏效：

df = (df.withColumn('ident', df['id'])
        .withColumn('custom.val1', df['v1'])
        .withColumn('custom.val2', df['v2'])
        .withColumn('custom.val3', df['v3'])
        .select(['ident', 'custom'])

但是，正如您所知道的那样，事实并非如此。任何帮助将不胜感激。

Answer 1

您可以使用 struct 创建结构列：

CohortProcessor

或者使用选择：

df.selectExpr('id', 'struct(v1, v2, v3) as custom').printSchema()

root
 |-- id: string (nullable = true)
 |-- custom: struct (nullable = false)
 |    |-- v1: string (nullable = true)
 |    |-- v2: string (nullable = true)
 |    |-- v3: string (nullable = true)

数据：

import pyspark.sql.functions as f

df.select('id', f.struct(df.v1, df.v2, df.v3).alias('custom')).show()

+---+---------+
| id|   custom|
+---+---------+
|  a|[b, c, d]|
+---+---------+

PySpark 在转换期间创建嵌套结构

1 个答案: