Question

我需要清除一些文本，如下面的代码所示：

import re
def clean_text(text):
    text = text.lower()
    #foction de replacement
    text = re.sub(r"i'm","i am",text)
    text = re.sub(r"she's","she is",text)
    text = re.sub(r"can't","cannot",text)
    text = re.sub(r"[-()\"#/@;:<>{}-=~|.?,]","",text)
    return text

clean_questions= []
for question in questions: 
    clean_questions.append(clean_text(question))

，此代码必须使我的questions列表整洁，但我的干净questions为空。我重新打开了spyder，并且列表已满，但是没有被清理，然后重新打开，结果空了.. 控制台错误说：

In [10] :clean_questions= [] 
   ...: for question in questions: 
   ...: clean_questions.append(clean_text(question))
Traceback (most recent call last):

  File "<ipython-input-6-d1c7ac95a43f>", line 3, in <module>
    clean_questions.append(clean_text(question))

  File "<ipython-input-5-8f5da8f003ac>", line 16, in clean_text
    text = re.sub(r"[-()\"#/@;:<>{}-=~|.?,]","",text)

  File "C:\Users\hp\Anaconda3\lib\re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)

  File "C:\Users\hp\Anaconda3\lib\re.py", line 286, in _compile
   p = sre_compile.compile(pattern, flags)

  File "C:\Users\hp\Anaconda3\lib\sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)

  File "C:\Users\hp\Anaconda3\lib\sre_parse.py", line 930, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)

  File "C:\Users\hp\Anaconda3\lib\sre_parse.py", line 426, in _parse_sub
    not nested and not items))

  File "C:\Users\hp\Anaconda3\lib\sre_parse.py", line 580, in _parse
    raise source.error(msg, len(this) + 1 + len(that))

error: bad character range }-=

我正在使用Python 3.6，特别是Anaconda构建版本Anaconda3-2018.12-Windows-x86_64。

Answer 1

您的字符类（如回溯所示）无效； }在序号值=之后（}为125，=为61），而它们之间的-表示它试图匹配任何字符从}的序数到=的序数。由于字符范围必须从低序到高序，所以125-> 61是无意义的，因此是错误的。

某种程度上，你很幸运； -周围的字符是否已反转，例如=-}，您会默默地删除序号61到125之间的所有字符，包括乱七八糟的标点符号，所有标准ASCII字母（大小写）。

您可以通过删除字符类中的第二个-来解决此问题（您已经在类的开头将其包括在内，不需要转义），从以下位置更改

text = re.sub(r"[-()\"#/@;:<>{}-=~|.?,]", "", text)

到

text = re.sub(r"[-()\"#/@;:<>{}=~|.?,]", "", text)

但是我建议在这里删除正则表达式；标有大量标点符号的错误风险很高，还有其他一些完全不涉及正则表达式的方法应该可以正常工作，并且即使您逃脱了所有重要内容，也不必担心（替代方法是过度转义，这使得正则表达式不可读，并且仍然容易出错。）

相反，用a simple str.translate call替换该行。首先，在函数make a translation table of the things to remove之外：

# The redundant - is harmless here since the result is a dict which dedupes anyway
killpunctuation = str.maketrans('', '', r"-()\"#/@;:<>{}-=~|.?,")

然后替换该行：

text = re.sub(r"[-()\"#/@;:<>{}-=~|.?,]","",text)

具有：

text = text.translate(killpunctuation)

它的运行速度至少应与正则表达式一样快（可能更快），并且不易出错，因为没有字符具有特殊含义（翻译表只是从Unicode常规到None的映射，意味着删除，另一个序号（表示单字符替换）或字符串（表示char-> multichar替换；它们没有特殊的转义概念）。如果目标是消除所有ASCII标点符号，那么最好使用string模块常量来定义转换表（这也使代码更易于记录文档，因此人们不会怀疑您是否要删除该表）。全部或只是一些标点符号，以及是否是有意的）：

import string
killpunctuation = str.maketrans('', '', string.punctuation)

碰巧，您现有的字符串不会删除所有标点符号（除其他外，它会丢失^，!，$等），因此此更改可能不会是正确的，但如果正确，则一定要做到。如果应该是标点符号的子集，那么您肯定要添加有关如何选择标点符号的注释，这样维护人员就不会怀疑您是否犯了错误。

Answer 2

像我这样使用

def clean_text(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

Answer 3

您需要适当地转义特殊字符并用方括号括起来

re.sub(r'[-\(\)\"#\/@;:<>\{\}\-=~|\.\?]', '', some_text)

更通用的正则表达式用于特殊字符（即字母或数字）是

[^a-zA-Z0-9]

使用python清理文本并重新

3 个答案: