Question

我在Python中有一个字符串：

string = "Hello I am a 21 !string. In section 3.2.F.1.2 we covered 1topic X. On the oth1er hand, in section 1.2.F.1.1 we covered Y. Lastly, in section F.3.2 we 23 covered Z."

我需要从文本中删除随机数和标点符号，以便：

“ a 21！string” --------> “ ... a string ...”和...

“发现1个主题x。” ---------> “已覆盖的主题”

我的最终字符串应为：

filtered = "hello i am a string in section 3.2.F.1.2 we covered topic x on the other hand in section 1.2.F.1.1 we covered y lastly in section 1.1.F.3.2 we covered z"

...使得代码“ 3.2.F.1.2”，“ 1.2.F.1.1”和“ 1.1.F.3.2”不受此影响。

我能够生成一个正则表达式来指定以下代码：

regex_codes = "[\d\.]{1,4}F[\.\d]{1,4}"

all_nums_punct = "[0-9 _.,!"'/$]*"

我无法弄清楚的是如何“选择和删除除这些代码（regex_code）模式之外的所有数字和标点符号（all_nums_punct）”。

我尝试使用“负超前”模式来忽略一切以我之前的stackOverflow article中的代码开头的内容，但是我的选择没有选择任何内容。

Answer 1

使用PyPI存储库中的regex软件包：

import regex

string = "Hello I am a 21 !string. In section 3.2.F.1.2 we covered 1topic X. On the oth1er hand, in section 1.2.F.1.1 we covered Y. Lastly, in section 1.1.F.3.2 we 23 covered Z."
string = regex.sub(r'''[\d\.]{1,4}F[\.\d]{1,4}(*SKIP)(*FAIL)|[0-9_.,!"'/$]''', '', string)
print(string)

打印：

Hello I am a  string In section F we covered topic X On the other hand in section F we covered Y Lastly in section F we  covered Z

我们匹配您的regex_codes表达式或您的all_nums_punct字符中的一个（不带空格）。如果我们匹配regex_codes表达式，我们将跳过这些字符并通过测试，然后尝试第二种选择。

结果可能会包含多个连续的空格字符。您将需要执行第二次替换操作，以单个空格替换它们：

import regex

string = "Hello I am a 21 !string. In section 3.2.F.1.2 we covered 1topic X. On the oth1er hand, in section 1.2.F.1.1 we covered Y. Lastly, in section 1.1.F.3.2 we 23 covered Z."
string = regex.sub(r'''[\d\.]{1,4}F[\.\d]{1,4}(*SKIP)(*FAIL)|[0-9_.,!"'/$]''', '', string)
string = regex.sub(r' +', ' ', string)
print(string)

打印：

Hello I am a string In section 3.2.F.1.2 we covered topic X On the other hand in section 1.2.F.1.1 we covered Y Lastly in section 1.1.F.3.2 we covered Z

更新

我将尝试回答您向@WiktorStribiżew提出的有关他的以下解决方案如何工作的问题：

re.sub(r"""([.\d]{1,4}F[.\d]{1,4})|[0-9_.,!"'/$]'""", '\1', $string)

任何正则表达式匹配的内容都将由'\1'替换，该值指定捕获组1的值。如果正则表达式匹配regex_codes，则捕获组1将被设置为任意值。匹配项，匹配的字符串将被自身替换，并且不会进行任何修改。但是，如果正则表达式与您要删除的字符之一匹配，则捕获组1将为空，并且匹配的字符串将由空字符串替换。此方法不需要regex软件包。同样，此方法也会留下连续的空格，您可能要按照我的指示将其删除。

正则表达式功能删除除指定表达式外的所有数字和标点符号？

1 个答案: