Question

我要搜索“收入”或“收入”之类的特定单词。为此，我创建了一个单词表并在文本中搜索单词。

但是，我的代码没有为包含“收入”等其他标点符号的单词返回任何结果。或“收入”。现在，我要删除这些标点符号，而不要删除数字中的点，例如“ 2.4”或任何其他标记，例如“％”。

我已经尝试过

table = str.maketrans({key: None for key in string.punctuation})
text_wo_dots = text.translate(table)

和

text_wo_dots = re.sub(r'[^\w\s]',' ',text)

但这删除了所有标点符号。

Answer 1

我建议您先将文本拆分成单独的单词，包括标点符号

text = ["This is an example, it contains 1.0 number and some words."]
raw_list = text.split()

现在，您可以删除元素末尾的标点符号。

cleaned_words = []
for word in raw_list:
    if word[-1] in ['.', ',', '!', '?']:
        cleaned_words.append(word[:-1])
    else:
        cleaned_words.append(word)

注1：如果您的文本包含1.之类的数字，例如1.0，则还需要考虑倒数第二个字符，如果{{1 }}的计算结果为isdigit()
注意2：如果句子中带有多个标点符号，则应运行while循环将其删除，然后仅在找不到更多的标点符号时追加。

True

Answer 2

像这样简单的事情也可能起作用：

[\.,:!?][\n\s]

[\.,:!?]包含一些标点符号，您可以根据需要添加更多的标点符号，而[\n\s]则必须在其后添加一个空格或换行符。

这是一个有效的示例：https://regex101.com/r/TcR6Ct/2

下面是Python代码：

import re

s = 'Bla, bla, bla 7.6 bla.'

pattern = '[\.,:!?][\n\s]'
s = re.sub(pattern, '', s+' ')
print(s)

Answer 3

您可以利用否定的(?!和否定的(?<!来断言直接在左侧的内容和直接在右侧的内容不是数字：

(?<!\d)[^\w\s]+(?!\d)

Regex demo | Python demo

例如：

import re
text = "income,and 4.6 test"
text_wo_dots = re.sub(r'(?<!\d)[^\w\s]+(?!\d)',' ',text)
print(text_wo_dots) # income and 4.6 test

如何仅删除标点符号，例如“。”。和“，”？

3 个答案: