Question

我想将一个嵌套对象（“结构”）添加到pySpark数据框中，并将其写到镶木地板中。我想重新创建以下内容（当前使用Scala spark + udf（How to add a new Struct column to a DataFrame）准备）：

 |-- _level1: struct (nullable = true)
 |    |-- level2a: struct (nullable = true)
 |    |    |-- fielda: string (nullable = true)
 |    |    |-- fieldb: string (nullable = true)
 |    |    |-- fieldc: string (nullable = true)
 |    |    |-- fieldd: string (nullable = true)
 |    |    |-- fielde: string (nullable = true)
 |    |    |-- fieldf: string (nullable = true)
 |    |-- level2b: struct (nullable = true)
 |    |    |-- fielda: string (nullable = true)
 |    |    |-- fieldb: string (nullable = true)
 |    |    |-- fieldc: string (nullable = true)

最好的方法是什么？

Answer 1

如果要嵌套列，则可以使用struct function。与直接使用Java虚拟机进行操作相比，这将比使用用户定义函数（udf）更有效。

这是一个例子：

In [1]: from pyspark.sql.functions import struct, col
   ...: 
   ...: df = spark.createDataFrame([(list("abcdefABC"))],
   ...:                            schema=list("abcdefghi")
   ...:                            )
   ...: df2 = df.select(
   ...:     struct(
   ...:         struct(*(col(_).alias("field%s" % _) for _ in "abcdef")).alias("level2a"),
   ...:         struct(*(col(_).alias("field%s" % (chr(ord(_) - 6))) for _ in ("ghi"))).alias("level2b")
   ...:     ).alias("_level1")
   ...: )
   ...: 
   ...: df2.printSchema()
   ...: 
   ...: 
root
 |-- _level1: struct (nullable = false)
 |    |-- level2a: struct (nullable = false)
 |    |    |-- fielda: string (nullable = true)
 |    |    |-- fieldb: string (nullable = true)
 |    |    |-- fieldc: string (nullable = true)
 |    |    |-- fieldd: string (nullable = true)
 |    |    |-- fielde: string (nullable = true)
 |    |    |-- fieldf: string (nullable = true)
 |    |-- level2b: struct (nullable = false)
 |    |    |-- fielda: string (nullable = true)
 |    |    |-- fieldb: string (nullable = true)
 |    |    |-- fieldc: string (nullable = true)

这里有一些字符串数学运算（chr用于将Unicode符号置于某个索引处，ord用于获取符号的代码点）以防止重复（struct(col("a").alias("fielda"), col("b").alias("fieldb"), …) ），但主要信息是：使用struct从其他列创建新的结构化列。

Answer 2

我想尽我所想。想法是为嵌套列（结构）创建架构，如下所示：

from pyspark.sql.functions import lit, udf
from pyspark.sql.types import StringType, StructField, StructType

schema = StructType([
            StructField('level2a',
                        StructType(
                            [
                                StructField('fielda', StringType(), nullable=False),
                                StructField('fieldb', StringType(), nullable=False),
                                StructField('fieldc', StringType(), nullable=False),
                                StructField('fieldd', StringType(), nullable=False),
                                StructField('fielde', StringType(), nullable=False),
                                StructField('fieldf', StringType(), nullable=False)
                            ])
                        ),
            StructField('level2b',
                        StructType(
                            [
                                StructField('fielda', StringType(), nullable=False),
                                StructField('fieldb', StringType(), nullable=False),
                                StructField('fieldc', StringType(), nullable=False)
                            ])
                        )
        ])

然后可以将其与udf（以上面的架构作为参数）结合使用，以获得所需的结果。


def make_meta(fielda, fieldb, fieldc, fieldd, fielde, fieldf, fieldalvl2, fieldblvl2, fieldclvl2):
    return [
        [fielda, fieldb, fieldc, fieldd, fielde, fieldf],
        [fieldalvl2, fieldblvl2, fieldclvl2]
    ]

test_udf = udf(lambda fielda,
               fieldb,
               fieldc,
               fieldd,
               fieldf,
               fielde,
               fieldalvl2, fieldblvl2, fieldclvl2:
               make_meta(fielda,
               fieldb,
               fieldc,
               fieldd,
               fieldf,
               fielde, fieldalvl2, fieldblvl2, fieldclvl2),
               schema)

df = spark.range(0, 5)
df.withColumn("test", test_udf(lit("a"), lit("b"), lit("c"),lit("d"),lit("e"),lit("f"),lit("a"),lit("b"),lit("c"))).printSchema()

打印以下内容：

root
 |-- id: long (nullable = false)
 |-- test: struct (nullable = true)
 |    |-- level2a: struct (nullable = true)
 |    |    |-- fielda: string (nullable = false)
 |    |    |-- fieldb: string (nullable = false)
 |    |    |-- fieldc: string (nullable = false)
 |    |    |-- fieldd: string (nullable = false)
 |    |    |-- fielde: string (nullable = false)
 |    |    |-- fieldf: string (nullable = false)
 |    |-- level2b: struct (nullable = true)
 |    |    |-- fielda: string (nullable = false)
 |    |    |-- fieldb: string (nullable = false)
 |    |    |-- fieldc: string (nullable = false)

在scala中，可以从udf返回case类的实例，这是我在python中试图做的事情（即返回一个对象）

Pyspark木地板与结构列

2 个答案: