使用正则表达式

时间:2019-12-16 14:31:07

标签: regex apache-spark pyspark

我需要删除字符串中的单引号。列名称为关键字。我有一个隐藏在字符串中的数组。所以我需要在Spark Dataframe中使用Regex从字符串开头和结尾删除单引号。字符串如下所示:

Keywords=
'
  [
      "shade perennials"," shade loving perennials"," perennial plants"," perennials"," perennial flowers"," perennial plants for shade"," full shade perennials"
  ]
'

我尝试了以下方法:

remove_single_quote = udf(lambda x: x.replace(u"'",""))
cleaned_df = spark_df.withColumn('Keywords', remove_single_quote('Keywords'))

但是单引号仍然存在,我也尝试过(u"\'","")

2 个答案:

答案 0 :(得分:1)

from pyspark.sql.functions import regexp_replace

new_df = data.withColumn('Keywords', regexp_replace('Keywords', "\'", ""))

答案 1 :(得分:0)

尝试regexp_replace

from pyspark.sql.functions import regexp_replace,col
    cleaned_df = spark_df.withColumn('Keywords', regexp_replace('Keywords',"\'",""))

OR

from pyspark.sql import functions as f
    cleaned_df = spark_df.withColumn('Keywords', f.regexp_replace('Keywords',"\'",""))

我没有测试过,但是应该可以工作

import ast

    cleaned_df = spark_df.withColumn('Keywords',ast.literal_eval('Keywords'))

refer