Question

我希望从python中的字符串中删除任何与“\ nPage 10 of 12 \ n”中的内容相匹配的文本，其中10和12总是不同的数字（循环遍历300多个文档，这些文档都有不同的页面的长度）。下面我的字符串中的一些文本的示例（然后我希望输出是什么）：

thisisaboutthen\n\n\nPage 2 of 12\n\nnowwearegoing\n\nPage 3 of 12\n\n\n\

Output -> thisisaboutthennnowwearegoing

我正在尝试代码：

page = r'\nPage \b\d+\b of \b\d+\b\n+'
return re.sub(page, '', string)

但我无法让它发挥作用。我试图引用此链接Python: Extract numbers from a string寻求帮助，但我似乎无法将数字和字母组合在一起。

我是python中的regex的新手，任何帮助都会很棒。当它只是字母或数字时，我能够让正则表达式工作，但在组合时遇到问题。

提前致谢

Answer 1

一种方式可能是

import re

string = """thisisaboutthen


Page 2 of 12

nowwearegoing

Page 3 of 12



"""

string = re.sub(r'\s*Page \d+ of \d+\s*', '', string)
print(string)

哪个收益

thisisaboutthennowwearegoing

请参阅a demo on regex101.com。

Answer 2

我不确定上下文，但您可以使用\n而不是指定换行符（\s）和空格。使用+，您可以说 regex 一个或多个。

import re
string = 'thisisaboutthen\n\n\nPage 2 of 12\n\nnowwearegoing\n\nPage 3 of 12\n\n\n'
pattern = r'\s+Page\s+\d+\s+of\s+\d+\s+'
print(re.sub(pattern, '', string))

使用\d选择数字，使用\s选择空格字符（空格和\ t，\ n，\ r \ n，\ f，\ v）。使用re.IGNORECASE可能很有用。

将数字和字母拉到一起python正则表达式

2 个答案: