Pyspark木地板与结构列

时间:2019-11-29 18:44:43

标签: python apache-spark pyspark

我想将一个嵌套对象(“结构”)添加到pySpark数据框中,并将其写到镶木地板中。我想重新创建以下内容(当前使用Scala spark + udf(How to add a new Struct column to a DataFrame)准备):

 |-- _level1: struct (nullable = true)
 |    |-- level2a: struct (nullable = true)
 |    |    |-- fielda: string (nullable = true)
 |    |    |-- fieldb: string (nullable = true)
 |    |    |-- fieldc: string (nullable = true)
 |    |    |-- fieldd: string (nullable = true)
 |    |    |-- fielde: string (nullable = true)
 |    |    |-- fieldf: string (nullable = true)
 |    |-- level2b: struct (nullable = true)
 |    |    |-- fielda: string (nullable = true)
 |    |    |-- fieldb: string (nullable = true)
 |    |    |-- fieldc: string (nullable = true)

最好的方法是什么?

2 个答案:

答案 0 :(得分:1)

如果要嵌套列,则可以使用struct function。与直接使用Java虚拟机进行操作相比,这将比使用用户定义函数(udf)更有效。

这是一个例子:

In [1]: from pyspark.sql.functions import struct, col
   ...: 
   ...: df = spark.createDataFrame([(list("abcdefABC"))],
   ...:                            schema=list("abcdefghi")
   ...:                            )
   ...: df2 = df.select(
   ...:     struct(
   ...:         struct(*(col(_).alias("field%s" % _) for _ in "abcdef")).alias("level2a"),
   ...:         struct(*(col(_).alias("field%s" % (chr(ord(_) - 6))) for _ in ("ghi"))).alias("level2b")
   ...:     ).alias("_level1")
   ...: )
   ...: 
   ...: df2.printSchema()
   ...: 
   ...: 
root
 |-- _level1: struct (nullable = false)
 |    |-- level2a: struct (nullable = false)
 |    |    |-- fielda: string (nullable = true)
 |    |    |-- fieldb: string (nullable = true)
 |    |    |-- fieldc: string (nullable = true)
 |    |    |-- fieldd: string (nullable = true)
 |    |    |-- fielde: string (nullable = true)
 |    |    |-- fieldf: string (nullable = true)
 |    |-- level2b: struct (nullable = false)
 |    |    |-- fielda: string (nullable = true)
 |    |    |-- fieldb: string (nullable = true)
 |    |    |-- fieldc: string (nullable = true)

这里有一些字符串数学运算(chr用于将Unicode符号置于某个索引处,ord用于获取符号的代码点)以防止重复(struct(col("a").alias("fielda"), col("b").alias("fieldb"), …) ),但主要信息是:使用struct从其他列创建新的结构化列。

答案 1 :(得分:0)

我想尽我所想。想法是为嵌套列(结构)创建架构,如下所示:

from pyspark.sql.functions import lit, udf
from pyspark.sql.types import StringType, StructField, StructType

schema = StructType([
            StructField('level2a',
                        StructType(
                            [
                                StructField('fielda', StringType(), nullable=False),
                                StructField('fieldb', StringType(), nullable=False),
                                StructField('fieldc', StringType(), nullable=False),
                                StructField('fieldd', StringType(), nullable=False),
                                StructField('fielde', StringType(), nullable=False),
                                StructField('fieldf', StringType(), nullable=False)
                            ])
                        ),
            StructField('level2b',
                        StructType(
                            [
                                StructField('fielda', StringType(), nullable=False),
                                StructField('fieldb', StringType(), nullable=False),
                                StructField('fieldc', StringType(), nullable=False)
                            ])
                        )
        ])

然后可以将其与udf(以上面的架构作为参数)结合使用,以获得所需的结果。


def make_meta(fielda, fieldb, fieldc, fieldd, fielde, fieldf, fieldalvl2, fieldblvl2, fieldclvl2):
    return [
        [fielda, fieldb, fieldc, fieldd, fielde, fieldf],
        [fieldalvl2, fieldblvl2, fieldclvl2]
    ]

test_udf = udf(lambda fielda,
               fieldb,
               fieldc,
               fieldd,
               fieldf,
               fielde,
               fieldalvl2, fieldblvl2, fieldclvl2:
               make_meta(fielda,
               fieldb,
               fieldc,
               fieldd,
               fieldf,
               fielde, fieldalvl2, fieldblvl2, fieldclvl2),
               schema)

df = spark.range(0, 5)
df.withColumn("test", test_udf(lit("a"), lit("b"), lit("c"),lit("d"),lit("e"),lit("f"),lit("a"),lit("b"),lit("c"))).printSchema()

打印以下内容:

root
 |-- id: long (nullable = false)
 |-- test: struct (nullable = true)
 |    |-- level2a: struct (nullable = true)
 |    |    |-- fielda: string (nullable = false)
 |    |    |-- fieldb: string (nullable = false)
 |    |    |-- fieldc: string (nullable = false)
 |    |    |-- fieldd: string (nullable = false)
 |    |    |-- fielde: string (nullable = false)
 |    |    |-- fieldf: string (nullable = false)
 |    |-- level2b: struct (nullable = true)
 |    |    |-- fielda: string (nullable = false)
 |    |    |-- fieldb: string (nullable = false)
 |    |    |-- fieldc: string (nullable = false)

在scala中,可以从udf返回case类的实例,这是我在python中试图做的事情(即返回一个对象)