我正在尝试将子字符串'$NUMBER'
替换为每一行“数字”列中的值。
我尝试过
from pyspark.sql.functions import udf
from pyspark.sql.Types import StringType
replace_udf = udf(
lambda long_text, number: long_text.replace("$NUMBER", number),
StringType()
)
df = df.withColumn('long_text',replace_udf(col('long_text'),col('number')))
和
from pyspark.sql.functions import expr
df = df.withColumn('long_text',expr("regexp_replace(long_text, '$NUMBER', number)"))
但没有任何效果。我无法弄清楚如何用另一列替换子字符串。
示例:
df1 = spark.createDataFrame(
[
("hahaha the $NUMBER is good",3),
("i dont know about $NUMBER",2),
("what is $NUMBER doing?",5),\
("ajajaj $NUMBER",2),
("$NUMBER dwarfs",1)
],
["long_text","number"]
)
输入:
+---------------------------------+------+
| long_text . |number|
+---------------------------------+------+
|hahaha the $NUMBER is good | 3|
| what is $NUMBER doing? | 5|
| ajajaj $NUMBER | 2|
+---------------------------------+------+
预期输出:
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 123| 2|
+--------------------+------+
类似的问题,答案没有涵盖列替换: Spark column string replace when present in other column (row)
答案 0 :(得分:2)
问题在于$
在正则表达式中具有特殊含义,这意味着匹配行尾。所以你的代码:
regexp_replace(long_text, '$NUMBER', number)
正在尝试匹配模式:行尾,然后是文字字符串NUMBER
(永远不会匹配任何内容)。
为了匹配$
(或任何其他正则表达式特殊字符),您必须使用\
对其进行转义。
from pyspark.sql.functions import expr
df = df.withColumn('long_text',expr("regexp_replace(long_text, '\$NUMBER', number)"))
df.show()
#+--------------------+------+
#| long_text|number|
#+--------------------+------+
#|hahaha the 3 is good| 3|
#| what is 5 doing?| 5|
#| ajajaj 2| 2|
#+--------------------+------+
答案 1 :(得分:0)
您必须将数字列转换为带有str()的字符串,然后才能在lambda中使用replace:
from pyspark.sql import types as T
from pyspark.sql import functions as F
l = [( 'hahaha the $NUMBER is good', 3)
,('what is $NUMBER doing?' , 5)
,('ajajaj $NUMBER ' , 2)]
df = spark.createDataFrame(l,['long_text','number'])
#Just added str() to your function
replace_udf = F.udf(lambda long_text, number: long_text.replace("$NUMBER", str(number)), T.StringType())
df.withColumn('long_text',replace_udf(F.col('long_text'),F.col('number'))).show()
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 2 | 2|
+--------------------+------+