Pyspark使用另一列中的值替换Spark数据帧列中的字符串

时间:2018-02-20 00:55:47

标签: python pyspark pyspark-sql

我想通过从另一列

创建搜索字符串来替换列中存在的值

id address st
1 2.PA1234.la 1234 2 10.PA125.la 125 3 2.PA156.ln 156
id address st
1 2.PA9999.la 1234 2 10.PA9999.la 125 3 2.PA9999.ln 156
我试过了

df.withColumn("address", regexp_replace("address","PA"+st,"PA9999"))
df.withColumn("address",regexp_replace("address","PA"+df.st,"PA9999")

两个接缝都失败了

TypeError: 'Column' object is not callable

可能类似于 Pyspark replace strings in Spark dataframe column

1 个答案:

答案 0 :(得分:0)

您也可以使用spark udf。

只要您需要用另一列中的值修改数据框条目,就可以应用该解决方案:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

pd_input = pd.DataFrame({'address': ['2.PA1234.la','10.PA125.la','2.PA156.ln'],
             'st':['1234','125','156']})

spark_df = sparkSession.createDataFrame(pd_input)


replace_udf = udf(lambda address, st: address.replace(st,'9999'), StringType())

spark_df.withColumn('adress_new',replace_udf(col('address'),col('st'))).show()

输出:

+-----------+----+------------+
|     adress|  st|  adress_new|
+-----------+----+------------+
|2.PA1234.la|1234| 2.PA9999.la|
|10.PA125.la| 125|10.PA9999.la|
| 2.PA156.ln| 156| 2.PA9999.ln|
+-----------+----+------------+