如何在pyspark中创建带有两个输入的UDF

时间:2017-07-11 08:21:13

标签: python-2.7 apache-spark pyspark

我是pyspark的新手,我正在尝试创建一个必须带两个输入列的简单udf,检查第二列是否有空格,如果是,则将第一列拆分为两个值并覆盖原始列。这就是我所做的:

def split(x, y):
if x == "EXDRA" and y == "":
    return ("EXT", "DCHA")
if x == "EXIZQ" and y == "":
    return ("EXT", "IZDA")

udf_split = udf(split, ArrayType())

df = df \
.withColumn("x", udf_split(df['x'], df['y'])[1]) \
.withColumn("y", udf_split(df['x'], df['y'])[0])

但是当我运行此代码时,我收到以下错误:

File "<stdin>", line 1, in <module>
TypeError: __init__() takes at least 2 arguments (1 given)

我做错了什么?

谢谢你, 阿尔瓦罗

2 个答案:

答案 0 :(得分:2)

我不确定你要做什么,但这就是我从我理解的方式做到的:

from pyspark.sql.types import *
from pyspark.sql.functions import udf, col

def split(x, y):
    if x == "EXDRA" and y == "":
        return ("EXT", "DCHA")
    if x == "EXIZQ" and y == "":
        return ("EXT", "IZDA")

schema = StructType([StructField("x1", StringType(), False), StructField("y1", StringType(), False)])
udf_split = udf(split, schema)

df = spark.createDataFrame([("EXDRA", ""), ("EXIZQ", ""), ("", "foo")], ("x", "y"))

df.show()

# +-----+---+
# |    x|  y|
# +-----+---+
# |EXDRA|   |
# |EXIZQ|   |
# |     |foo|
# +-----+---+

df = df \
.withColumn("split", udf_split(df['x'], df['y'])) \
.withColumn("x", col("split.x1")) \
.withColumn("y", col("split.y1"))

df.printSchema()

# root
#  |-- x: string (nullable = true)
#  |-- y: string (nullable = true)
#  |-- split: struct (nullable = true)
#  |    |-- x1: string (nullable = false)
#  |    |-- y1: string (nullable = false)


df.show()

# +----+----+----------+
# |   x|   y|     split|
# +----+----+----------+
# | EXT|DCHA|[EXT,DCHA]|
# | EXT|IZDA|[EXT,IZDA]|
# |null|null|      null|
# +----+----+----------+

答案 1 :(得分:0)

猜猜您必须将udf定义为:

udf_split = udf(split, ArrayType(StringType()))