Question

我正在编写一个正则表达式，以选择下面文本中超过4位数字的数字前的30个字符。这是我的代码：

text = "I went and I bought few tickets and ticket numbers 100000,100001 and 100002.I bought them for 200,300 and 400 USD. Box office collections were 55555555 USD"

reg=".{0,30}(?:[\d]+[ .]?){5,}"
regc=re.compile(reg)
res=regc.findall(text)

这在下面给出了部分结果

我只能在100000之前得到30个字符。

如何在100001之前获取30个字符，又如何在100002之前获取30个字符？

Answer 1

除了换行符，您正在寻找前面的30个字符，？=正面看，但不包括在捕获组中

/.{30}(?=100001)/g

https://regexr.com/4293v

Answer 2

由于需要重叠的匹配项，因此需要使用环顾四周。但是，re中的lookbehinds具有固定宽度，因此，您可以利用技巧：反转字符串，使用带正则表达式的正则表达式，然后反转匹配项：

import re
rev_rx = r'((?:\d+[ .]?){5,})(?=(.{0,30}))'
text="I went and I bought few tickets and ticket numbers 100000,100001 and 100002.I bought them for 200,300 and 400 USD. Box office collections were 55555555 USD"
results = [ "{}{}".format(y[::-1], x[::-1]) for x, y in re.findall(rev_rx, text[::-1]) ]
print(results)
# => ['D. Box office collections were 55555555', 'cket numbers 100000,100001 and 100002', 'ets and ticket numbers 100000,100001', 'few tickets and ticket numbers 100000']

请参见Python demo。

((?:\d+[ .]?){5,})(?=(.{0,30}))正则表达式将1个数字的五个或更多序列以及一个可选的空格或逗号匹配并捕获到组1中。然后，正向查询将检查字符串中是否包含0到30个字符。子字符串将捕获到组2中。因此，您所需要做的就是将反向的组2和组1值串联起来，以获得所需的匹配项。

Answer 3

您可以通过将一些简单的正则表达式与字符串方法结合使用，以获取任何数字前超过30个字符的4个以上的数字（而不是使用更复杂的正则表达式来查找匹配项并捕获所需的字符）。

下面的示例使用正则表达式查找所有4位以上的数字，然后使用str.find()获取每个匹配项在原始文本中的位置，以便可以对前30个字符进行切片：

import re

text = "I went and I bought few tickets and ticket numbers 100000,100001 and 100002.I bought them for 200,300 and 400 USD. Box office collections were 55555555 USD"

patt = re.compile(r'\d{5,}')
nums = patt.findall(text)
matches = [text[:text.find(n)][-30:] for n in nums]

print(matches)
# OUTPUT (shown on multiple lines for readability)
# [
#     'ew tickets and ticket numbers ',
#     'ets and ticket numbers 100000,',
#     'ket numbers 100000,100001 and ',
#     '. Box office collections were '
# ]

正则表达式可捕获任何大于4位数字的重叠匹配项

3 个答案: