Question

假设我有这样的文字：

Our favorite numbers are 5, 6, and 7, but his favorite number is 0. Also, this text contains 2 sentences.

让我们说我只想在本文中获得喜欢的数字，例如。除非短语favorite number存在，否则我无法知道此文中是否有喜欢的号码。所以我基本上试图解析短语favorite number（或favorite numbers）周围的数字。预期结果应该是这样的：

['5', '6', '7', '0']

我尝试使用正则表达式，但到目前为止我已经失败了。最合乎逻辑的方法是什么？

编辑：在阅读@ LouiseDavies的question后，我将在下面添加另一个示例：

Alice has 2 favorite numbers: 11 and 12. Bob has 10 favorite numbers: 0, 100, 1264, 598, 78496, 33546, 1028896, 23, 48, 6.

所以在这个例子中，我的输出应该是这样的（顺序并不重要）：

['11', '12', '0', '100', '1264', '598', '78496', '33546', '1028896', '23', '48', '6']

Answer 1

您没有显示任何代码，因此我不会写出完整的解决方案。

您可以在.分割，过滤包含"favourite number"的句子并从这些句子中提取数字。你不应该试图为整个句子写一个正则表达式。

这是一个开始：

text = "Our favorite numbers are 5, 6, and 7, but his favorite number is 0. Also, this text contains 2 sentences."

import re
pattern = re.compile("favou?rite numbers?", re.I)

print([sentence for sentence in text.split('.') if pattern.search(sentence)])
# ['Our favorite numbers are 5, 6, and 7, but his favorite number is 0']

既然您已经拥有了有趣的句子列表，那么您就可以从一个完整的解决方案中获得一个列表理解和一个re.findall('d+')。

Answer 2

您可以使用正则表达式：

import re
import itertools
s = 'Our favorite numbers are 5, 6, and 7, but his favorite number is 0. Also, this text contains 2 sentences.'

numbers = re.findall('(?<= favorite number is)[,\s\d]+|(?<=favorite numbers are)[,\s\dand]+', s)
final_numbers = list(itertools.chain(*[re.findall('\d+', i) for i in numbers]))

输出：

['5', '6', '7', '0']

Answer 3

我在手机中，所以无法检查我的代码，在我的家里我会检查它。

text = "Our favorite numbers are 5, 6, and 7, but his favorite number is 0. Also, this text contains 2 sentences."
sentences = text.split('.')
numbers = set()
for sentence in sentences:
    if "favorite number" in sentence:
        numbers = numbers.union(set(sentence))
numbers = list(numbers.difference(set([*[chr(n) for n in range(32,48)],*[chr(n) for n in range(58,168)]])))
numbers = [int(x) for x in numbers]
print(numbers)

另一种方式可能是：

text = "Our favorite numbers are 5, 6, and 7, but his favorite number is 0. Also, this text contains 2 sentences."
sentences = text.split('.')
numbers = []
for sentence in sentences:
    if "favorite number" in sentence:
        for character in sentence:
            try:
                number.append(int(character))
            except ValueError:
                pass
print(numbers)

使用timeit.timeit并检查函数100000次（没有print()），第一种方式为3.614777436797135，第二种方式为12.934136042429973 。所以第一个不是完全有序的，但它的3.57次要快一些。

Answer 4

这比其他答案要长一些，但如果您的需求发生变化，状态机方法可能会变得更加可维护。

import re

text = """
Our favorite numbers are 5, 6, and 7, but his favorite number is 0. Also, this text contains 2 sentences.
"""

r = re.compile(r"(\d+|[a-zA-Z ]+)")

faves = False
lst = []

while True:
    s = r.search(text)
    if s is None:
        break
    x = s.group(1).strip()
    if x:
        if x == 'and':
            pass
        elif re.search(r'favou?rite numbers?', x):
            faves = True
        elif re.match(r"^\d+$", x) and faves:
            lst.append(x)
        else:
            faves = False
    text = text[s.end():]

print lst

如何解析特定单词/短语

4 个答案: