我想将一个嵌套对象(“结构”)添加到pySpark数据框中,并将其写到镶木地板中。我想重新创建以下内容(当前使用Scala spark + udf(How to add a new Struct column to a DataFrame)准备):
|-- _level1: struct (nullable = true)
| |-- level2a: struct (nullable = true)
| | |-- fielda: string (nullable = true)
| | |-- fieldb: string (nullable = true)
| | |-- fieldc: string (nullable = true)
| | |-- fieldd: string (nullable = true)
| | |-- fielde: string (nullable = true)
| | |-- fieldf: string (nullable = true)
| |-- level2b: struct (nullable = true)
| | |-- fielda: string (nullable = true)
| | |-- fieldb: string (nullable = true)
| | |-- fieldc: string (nullable = true)
最好的方法是什么?
答案 0 :(得分:1)
如果要嵌套列,则可以使用struct
function。与直接使用Java虚拟机进行操作相比,这将比使用用户定义函数(udf)更有效。
这是一个例子:
In [1]: from pyspark.sql.functions import struct, col
...:
...: df = spark.createDataFrame([(list("abcdefABC"))],
...: schema=list("abcdefghi")
...: )
...: df2 = df.select(
...: struct(
...: struct(*(col(_).alias("field%s" % _) for _ in "abcdef")).alias("level2a"),
...: struct(*(col(_).alias("field%s" % (chr(ord(_) - 6))) for _ in ("ghi"))).alias("level2b")
...: ).alias("_level1")
...: )
...:
...: df2.printSchema()
...:
...:
root
|-- _level1: struct (nullable = false)
| |-- level2a: struct (nullable = false)
| | |-- fielda: string (nullable = true)
| | |-- fieldb: string (nullable = true)
| | |-- fieldc: string (nullable = true)
| | |-- fieldd: string (nullable = true)
| | |-- fielde: string (nullable = true)
| | |-- fieldf: string (nullable = true)
| |-- level2b: struct (nullable = false)
| | |-- fielda: string (nullable = true)
| | |-- fieldb: string (nullable = true)
| | |-- fieldc: string (nullable = true)
这里有一些字符串数学运算(chr
用于将Unicode符号置于某个索引处,ord
用于获取符号的代码点)以防止重复(struct(col("a").alias("fielda"), col("b").alias("fieldb"), …)
),但主要信息是:使用struct
从其他列创建新的结构化列。
答案 1 :(得分:0)
我想尽我所想。想法是为嵌套列(结构)创建架构,如下所示:
from pyspark.sql.functions import lit, udf
from pyspark.sql.types import StringType, StructField, StructType
schema = StructType([
StructField('level2a',
StructType(
[
StructField('fielda', StringType(), nullable=False),
StructField('fieldb', StringType(), nullable=False),
StructField('fieldc', StringType(), nullable=False),
StructField('fieldd', StringType(), nullable=False),
StructField('fielde', StringType(), nullable=False),
StructField('fieldf', StringType(), nullable=False)
])
),
StructField('level2b',
StructType(
[
StructField('fielda', StringType(), nullable=False),
StructField('fieldb', StringType(), nullable=False),
StructField('fieldc', StringType(), nullable=False)
])
)
])
然后可以将其与udf(以上面的架构作为参数)结合使用,以获得所需的结果。
def make_meta(fielda, fieldb, fieldc, fieldd, fielde, fieldf, fieldalvl2, fieldblvl2, fieldclvl2):
return [
[fielda, fieldb, fieldc, fieldd, fielde, fieldf],
[fieldalvl2, fieldblvl2, fieldclvl2]
]
test_udf = udf(lambda fielda,
fieldb,
fieldc,
fieldd,
fieldf,
fielde,
fieldalvl2, fieldblvl2, fieldclvl2:
make_meta(fielda,
fieldb,
fieldc,
fieldd,
fieldf,
fielde, fieldalvl2, fieldblvl2, fieldclvl2),
schema)
df = spark.range(0, 5)
df.withColumn("test", test_udf(lit("a"), lit("b"), lit("c"),lit("d"),lit("e"),lit("f"),lit("a"),lit("b"),lit("c"))).printSchema()
打印以下内容:
root
|-- id: long (nullable = false)
|-- test: struct (nullable = true)
| |-- level2a: struct (nullable = true)
| | |-- fielda: string (nullable = false)
| | |-- fieldb: string (nullable = false)
| | |-- fieldc: string (nullable = false)
| | |-- fieldd: string (nullable = false)
| | |-- fielde: string (nullable = false)
| | |-- fieldf: string (nullable = false)
| |-- level2b: struct (nullable = true)
| | |-- fielda: string (nullable = false)
| | |-- fieldb: string (nullable = false)
| | |-- fieldc: string (nullable = false)
在scala中,可以从udf返回case类的实例,这是我在python中试图做的事情(即返回一个对象)