Question

我的字符串包含“Page 2”格式的页码。我想删除这些页码。

字符串可能是：

“第一个是第10页，然后是第1页，然后是第12页”

当前实施：

是否有更优雅的方式删除所有“Page＃{some_number}”而不是下面的内容？

page_numbers = [
    'Page 1', 
    'Page 2', 
    'Page 3', 
    'Page 4', 
    'Page 5', 
    'Page 6', 
    'Page 7', 
    'Page 8', 
    'Page 9',
    'Page 10',
    'Page 11',
    'Page 12']

x = "The first is Page 10 and then Page 1 and then Page 12"

for v in page_numbers:
    x = x.replace(v, ' ')

print(x)

Answer 1

这应该使用re模块：

>>> import re
>>> x = "The first is Page 10 and then Page 1 and then Page 12"
>>> re.sub(r'(\s?Page \d{1,3})', ' ', x)
'The first is  and then  and then '

re.sub将使用x上的第二个参数（替换字符串）替换正则表达式的所有匹配项（第三个参数）

那么，正则表达式在做什么？

\s?只是在 Page n 文本之前占用一个空格，如果它在那里
Page完全匹配"Page "字符串（带空格）
\d{1,3}匹配1到3位数字。如果您只处理99，那么请使用\d{1,2}。如果您需要更多，请调整。

Answer 2

您可以使用正则表达式来执行此操作：

import re

x ="The first is Page 10 and then Page 1 and then Page 12"
print(re.sub(r'Page \d+', '', x))

这会找到所有“Page”后跟空格和任意数量的数字，并将其替换为空。

如果你想在单词之间保持间距，请执行以下操作：

re.sub(r'Page\s\d+\s', '', x)

这将匹配后面的空格并替换它，因为如果它没有，你将有2个空格（一个来自Page之前，一个来自之后）

Answer 3

re.sub的回答是正确的，但不完整。如果您只想删除某些页码，那么单独使用简单的re.sub解决方案是不够的。你需要提供一个回调才能使其正常工作。

p_set = set(page_numbers)

def replace(m):
    p = m.group()
    return ' ' if p in p_set else p

现在，将replace作为回调传递给re.sub -

>>> re.sub('Page \d+', replace, x)
'The first is   and then   and then  '

re.sub的第二个参数接受回调，在找到匹配项时调用。相应的match对象作为参数传递给replace，它应返回替换值。

我还将page_numbers转换为set。这使我可以在确定是保留还是丢弃匹配的字符串时在p_set上执行常量时间查找。

为了获得更大的灵活性，您可以支持删除范围内的页码 -

def replace(m):
    return ' ' if int(m.group(1)) in range(1, 13) else m.group()

并适当地调用它 -

>>> re.sub('Page (\d+)', replace, x)
'The first is   and then   and then  '

假设您的删除范围是连续的，这比维护一个列表/一组页码更有效。另一件需要注意的事情是，使用range运算符对in对象进行成员资格检查的计算成本很低（恒定时间）。

从字符串中删除“page + some_number”的所有实例

3 个答案: