如何解析特定单词/短语

时间:2017-11-13 14:29:36

标签: python

假设我有这样的文字:

Our favorite numbers are 5, 6, and 7, but his favorite number is 0. Also, this text contains 2 sentences.

让我们说我只想在本文中获得喜欢的数字,例如。除非短语favorite number存在,否则我无法知道此文中是否有喜欢的号码。所以我基本上试图解析短语favorite number(或favorite numbers)周围的数字。预期结果应该是这样的:

['5', '6', '7', '0']

我尝试使用正则表达式,但到目前为止我已经失败了。最合乎逻辑的方法是什么?

编辑:在阅读@ LouiseDavies的question后,我将在下面添加另一个示例:

Alice has 2 favorite numbers: 11 and 12. Bob has 10 favorite numbers: 0, 100, 1264, 598, 78496, 33546, 1028896, 23, 48, 6.

所以在这个例子中,我的输出应该是这样的(顺序并不重要):

['11', '12', '0', '100', '1264', '598', '78496', '33546', '1028896', '23', '48', '6']

4 个答案:

答案 0 :(得分:1)

您没有显示任何代码,因此我不会写出完整的解决方案。

您可以在.分割,过滤包含"favourite number"的句子并从这些句子中提取数字。你不应该试图为整个句子写一个正则表达式。

这是一个开始:

text = "Our favorite numbers are 5, 6, and 7, but his favorite number is 0. Also, this text contains 2 sentences."

import re
pattern = re.compile("favou?rite numbers?", re.I)

print([sentence for sentence in text.split('.') if pattern.search(sentence)])
# ['Our favorite numbers are 5, 6, and 7, but his favorite number is 0']

既然您已经拥有了有趣的句子列表,那么您就可以从一个完整的解决方案中获得一个列表理解和一个​​re.findall('d+')

答案 1 :(得分:0)

您可以使用正则表达式:

import re
import itertools
s = 'Our favorite numbers are 5, 6, and 7, but his favorite number is 0. Also, this text contains 2 sentences.'

numbers = re.findall('(?<= favorite number is)[,\s\d]+|(?<=favorite numbers are)[,\s\dand]+', s)
final_numbers = list(itertools.chain(*[re.findall('\d+', i) for i in numbers]))

输出:

['5', '6', '7', '0']

答案 2 :(得分:0)

我在手机中,所以无法检查我的代码,在我的家里我会检查它。

text = "Our favorite numbers are 5, 6, and 7, but his favorite number is 0. Also, this text contains 2 sentences."
sentences = text.split('.')
numbers = set()
for sentence in sentences:
    if "favorite number" in sentence:
        numbers = numbers.union(set(sentence))
numbers = list(numbers.difference(set([*[chr(n) for n in range(32,48)],*[chr(n) for n in range(58,168)]])))
numbers = [int(x) for x in numbers]
print(numbers)

另一种方式可能是:

text = "Our favorite numbers are 5, 6, and 7, but his favorite number is 0. Also, this text contains 2 sentences."
sentences = text.split('.')
numbers = []
for sentence in sentences:
    if "favorite number" in sentence:
        for character in sentence:
            try:
                number.append(int(character))
            except ValueError:
                pass
print(numbers)

使用timeit.timeit并检查函数100000次(没有print()),第一种方式为3.614777436797135,第二种方式为12.934136042429973 。所以第一个不是完全有序的,但它的3.57次要快一些。

答案 3 :(得分:0)

这比其他答案要长一些,但如果您的需求发生变化,状态机方法可能会变得更加可维护。

import re

text = """
Our favorite numbers are 5, 6, and 7, but his favorite number is 0. Also, this text contains 2 sentences.
"""

r = re.compile(r"(\d+|[a-zA-Z ]+)")

faves = False
lst = []

while True:
    s = r.search(text)
    if s is None:
        break
    x = s.group(1).strip()
    if x:
        if x == 'and':
            pass
        elif re.search(r'favou?rite numbers?', x):
            faves = True
        elif re.match(r"^\d+$", x) and faves:
            lst.append(x)
        else:
            faves = False
    text = text[s.end():]

print lst
相关问题