将包含美元符号($)的子字符串替换为其他列值pyspark

时间:2019-03-18 11:15:02

标签: regex apache-spark replace pyspark

我正在尝试将子字符串'$NUMBER'替换为每一行“数字”列中的值。 我尝试过

from pyspark.sql.functions import udf
from pyspark.sql.Types import StringType

replace_udf = udf(
    lambda long_text, number: long_text.replace("$NUMBER", number),
    StringType()
)

df = df.withColumn('long_text',replace_udf(col('long_text'),col('number')))

from pyspark.sql.functions import expr

df = df.withColumn('long_text',expr("regexp_replace(long_text, '$NUMBER', number)"))

但没有任何效果。我无法弄清楚如何用另一列替换子字符串。

示例:

df1 = spark.createDataFrame(
    [
        ("hahaha the $NUMBER is good",3),
        ("i dont know about $NUMBER",2),
        ("what is $NUMBER doing?",5),\
        ("ajajaj $NUMBER",2),
        ("$NUMBER dwarfs",1)
    ],
    ["long_text","number"]
) 

输入:

+---------------------------------+------+
|           long_text .           |number|
+---------------------------------+------+
|hahaha the $NUMBER is good       |     3|
|    what is $NUMBER doing?       |     5|
|          ajajaj $NUMBER         |     2|
+---------------------------------+------+

预期输出:

+--------------------+------+
|           long_text|number|
+--------------------+------+
|hahaha the 3 is good|     3|
|    what is 5 doing?|     5|
|          ajajaj 123|     2|
+--------------------+------+

类似的问题,答案没有涵盖列替换: Spark column string replace when present in other column (row)

2 个答案:

答案 0 :(得分:2)

问题在于$在正则表达式中具有特殊含义,这意味着匹配行尾。所以你的代码:

regexp_replace(long_text, '$NUMBER', number)

正在尝试匹配模式:行尾,然后是文字字符串NUMBER(永远不会匹配任何内容)。

为了匹配$(或任何其他正则表达式特殊字符),您必须使用\对其进行转义。

from pyspark.sql.functions import expr

df = df.withColumn('long_text',expr("regexp_replace(long_text, '\$NUMBER', number)"))
df.show()
#+--------------------+------+
#|           long_text|number|
#+--------------------+------+
#|hahaha the 3 is good|     3|
#|    what is 5 doing?|     5|
#|            ajajaj 2|     2|
#+--------------------+------+

答案 1 :(得分:0)

您必须将数字列转换为带有str()的字符串,然后才能在lambda中使用replace:

from pyspark.sql import types as T
from pyspark.sql import functions as F

l = [(  'hahaha the $NUMBER is good',    3)
     ,('what is $NUMBER doing?'         ,   5)
     ,('ajajaj $NUMBER  '       ,  2)]
df = spark.createDataFrame(l,['long_text','number'])

#Just added str() to your function
replace_udf = F.udf(lambda long_text, number: long_text.replace("$NUMBER", str(number)), T.StringType())

df.withColumn('long_text',replace_udf(F.col('long_text'),F.col('number'))).show()

+--------------------+------+ 
|           long_text|number| 
+--------------------+------+ 
|hahaha the 3 is good|     3| 
|    what is 5 doing?|     5|
|           ajajaj 2 |     2| 
+--------------------+------+