Question

我已经在PySpark的Pandas UDF中编写了数据预处理代码。我正在使用lambda函数从一列的所有记录中提取文本的一部分。

这是我的代码的样子：

@pandas_udf("string", PandasUDFType.SCALAR)
def get_X(col):
      return col.apply(lambda x: x.split(',')[-1] if len(x.split(',')) > 0 else x)

df = df.withColumn('X', get_first_name(df.Y))

这工作正常，并给出了所需的结果。但是我需要用Spark等效代码编写相同的逻辑。有办法吗？谢谢。

Answer 1

我认为一个功能substring_index足以完成此特定任务：

from pyspark.sql.functions import substring_index

df = spark.createDataFrame([(x,) for x in ['f,l', 'g', 'a,b,cd']], ['c1'])

df2.withColumn('c2', substring_index('c1', ',', -1)).show()                                                                 
+------+---+
|    c1| c2|
+------+---+
|   f,l|  l|
|     g|  g|
|a,b,cd| cd|
+------+---+

Answer 2

您可以使用when来实现if-then-else logic：

首先split列，然后计算其size。如果大小大于0，则take the last element from the split array。否则，返回原始列。

from pyspark.sql.functions import split, size, when

def get_first_name(col):
    col_split = split(col, ',')
    split_size = size(col_split)
    return when(split_size > 0, col_split[split_size-1]).otherwise(col)

作为示例，假设您具有以下DataFrame：

df.show()
#+---------+
#| BENF_NME|
#+---------+
#|Doe, John|
#|  Madonna|
#+---------+

您可以像以前一样调用新函数：

df = df.withColumn('First_Name', get_first_name(df.BENF_NME))
df.show()
#+---------+----------+
#| BENF_NME|First_Name|
#+---------+----------+
#|Doe, John|      John|
#|  Madonna|   Madonna|
#+---------+----------+

Answer 3

给出以下数据框df：

df.show()
# +-------------+
# |     BENF_NME|
# +-------------+
# |    Doe, John|
# |          Foo|
# |Baz, Quux,Bar|
# +-------------+

您可以简单地使用regexp_extract()选择名字：

from pyspark.sql.functions import regexp_extract
df.withColumn('First_Name', regexp_extract(df.BENF_NME, r'(?:.*,\s*)?(.*)', 1)).show()
# +-------------+----------+
# |     BENF_NME|First_Name|
# +-------------+----------+
# |    Doe, John|      John|
# |          Foo|       Foo|
# |Baz, Quux,Bar|       Bar|
# +-------------+----------+

如果您不在乎可能的前导空格，则substring_index()提供了一种替代原始逻辑的简单方法：

from pyspark.sql.functions import substring_index
df.withColumn('First_Name', substring_index(df.BENF_NME, ',', -1)).show()
# +-------------+----------+
# |     BENF_NME|First_Name|
# +-------------+----------+
# |    Doe, John|      John|
# |          Foo|       Foo|
# |Baz, Quux,Bar|       Bar|
# +-------------+----------+

在这种情况下，第一行的First_Name有一个前导空格：

df.withColumn(...).collect()[0]
# Row(BENF_NME=u'Doe, John', First_Name=u' John'

如果仍然要使用自定义函数，则需要使用udf()创建用户定义的函数（UDF）：

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
get_first_name = udf(lambda s: s.split(',')[-1], StringType())
df.withColumn('First_Name', get_first_name(df.BENF_NME)).show()
# +-------------+----------+
# |     BENF_NME|First_Name|
# +-------------+----------+
# |    Doe, John|      John|
# |          Foo|       Foo|
# |Baz, Quux,Bar|       Bar|
# +-------------+----------+

请注意，UDF比内置的Spark函数要慢，尤其是Python UDF。

PySpark相当于熊猫UDF中的lambda函数

3 个答案: