我需要删除字符串中的单引号。列名称为关键字。我有一个隐藏在字符串中的数组。所以我需要在Spark Dataframe中使用Regex从字符串开头和结尾删除单引号。字符串如下所示:
Keywords=
'
[
"shade perennials"," shade loving perennials"," perennial plants"," perennials"," perennial flowers"," perennial plants for shade"," full shade perennials"
]
'
我尝试了以下方法:
remove_single_quote = udf(lambda x: x.replace(u"'",""))
cleaned_df = spark_df.withColumn('Keywords', remove_single_quote('Keywords'))
但是单引号仍然存在,我也尝试过(u"\'","")
答案 0 :(得分:1)
from pyspark.sql.functions import regexp_replace
new_df = data.withColumn('Keywords', regexp_replace('Keywords', "\'", ""))
答案 1 :(得分:0)
尝试regexp_replace
from pyspark.sql.functions import regexp_replace,col
cleaned_df = spark_df.withColumn('Keywords', regexp_replace('Keywords',"\'",""))
OR
from pyspark.sql import functions as f
cleaned_df = spark_df.withColumn('Keywords', f.regexp_replace('Keywords',"\'",""))
我没有测试过,但是应该可以工作
import ast
cleaned_df = spark_df.withColumn('Keywords',ast.literal_eval('Keywords'))