Question

我正在尝试构建一个正则表达式模式，以便从字符串中删除标点符号。我已决定使用punctuation库中的string。但是，当我执行它时，Spark会返回一个错误，该错误是一个未封闭的字符。

我怀疑punctuation中的字符在执行期间关闭了引号。我觉得这应该很容易修复，但我不确定如何。我的代码如下：

from pyspark.sql.functions import regexp_replace, trim, col, lower
import string

def removePunctuation(column):

    no_punct = regexp_replace(column, string.punctuation, '')
    lowered = lower(no_punct)
    cleaned = strip(lowered)
    return cleaned

我收到此错误org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 86.0 failed 1 times, most recent failure: Lost task 0.0 in stage 86.0 (TID 3709, localhost): java.util.regex.PatternSyntaxException: Unclosed character class near index 31

Answer 1

简单明了：

regexp_replace(column, "\p{Punct}", "")

要使用string.punctuation，您必须转义单个字符并将它们放入字符集中，但它容易出错且难看：

import re

regexp_replace(column, "[{0}]".format(re.escape(string.punctuation)), "")

Answer 2

column = regexp_replace(column, '[^\w\s]', '')
column = regexp_replace(column, '_', '')

请注意，下划线被视为合法的字母数字字符，因此需要特殊删除。

Answer 3

你可能只包括你想要的东西：数字，字母和空格

return lower(trim(regexp_replace(regexp_replace(column, '[^\w\s]', ''),'_','')))

在Spark中使用标点符号的未闭合字符类

3 个答案: