Question

我有一个带有以下架构的数据框'df'：

document.getElementById('target_2').innerHTML+=jjj;

列users_info是一个包含多个结构的数组。

我想更改列名，以便'batch_key'变为'batchKey'，'users_info'变为'usersInfo'，'first_name'变为'firstName'等等。

我从这段代码开始：

root
 |-- batch_key: string (nullable = true)
 |-- company_id: integer (nullable = true)
 |-- users_info: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- first_name: string (nullable = true)
 |    |    |-- last_name: long (nullable = true)
 |    |    |-- total_amount: double (nullable = true)

但是这段代码只会更改batch_key，company_id和users_info列的名称，因为df2 = df regex = new Regex("_(.)") for (col <- df.columns) { df2 = df2.withColumnRenamed(col, regex.replaceAllIn(col, { M => M.group(1).toUpperCase })) }会返回for (col <- df.columns)。

users_info下的嵌套列不会更改。如何修改上面的代码，以便我可以访问嵌套列并更改其列名？

Answer 1

用词：创建一个由扁平化架构组成的Seq。然后使用org.apache.spark.sql.functions.col并创建Seq的{{1}}，您可以在其中将正则表达式用作新的列名。然后使用select从df中选择所有列，但通过调用col({old column name}).as({new column name})使用新名称。

更多详细信息：

使用此处提供的解决方案Automatically and Elegantly flatten DataFrame in Spark SQL，您可以先展平架构并将其放入df.select({the seq of cols})

Seq

然后，将正则表达式应用于如此获得的def fullFlattenSchema(schema: StructType): Seq[String] = { def helper(schema: StructType, prefix: String): Seq[String] = { val fullName: String => String = name => {if (prefix.isEmpty) name else s"$prefix.$name"} schema.fields.flatMap { case StructField(name, inner: StructType, _, _) => helper(inner, fullName(name)) case StructField(name, _, _, _) => Seq(fullName(name)) } } helper(schema, "") }

Seq

最后，您使用重命名的列选择数据框，并检查模式是否如您所愿

val renamed_columns = fullFlattenSchema(df.schema).map(c =>col(c).as(regex.replaceAllIn(c, { M => M.group(1).toUpperCase })));

Scala - 如何访问数据框中的嵌套列？

1 个答案: