正则表达式 - Python - 捕获单词之间的所有内容

时间:2016-02-28 15:38:44

标签: regex python-2.7 keyword findall

是否可以捕获包含关键字(时间)的特定句子?例如:

`我想捕捉这部分(时间)和这部分。不是这句话,因为它不包含我们的关键字。但也是这句话,因为它包含(时间)'

- 注1:时间不在括号中,代表时间范围:例如:12:45,10:45等。

- 注2:我正在寻找一个正则表达式,当该关键字存在时捕获所有句子。如果findall函数没有在句子中找到关键字,那么它将继续到下一个句子。

- 注3:最后我们有一个包含特定关键词的句子总和。

我添加了一些其他信息。测试您提供的代码和文本。

text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"

capture_1 = re.findall("(?:\.|\A)(.*\d*:\d*.*)\.", text , flags=re.DOTALL)
capture_2 = re.findall(r'(\..*)(\d*:\d*)(.*) ',text, flags=re.DOTALL )

capture_1给了我这个:

['他在那里。恐怖分子于23:45用远程爆炸装置摧毁了该建筑物。他于23:58从露台的阳台逃走了。他没有活下来。死亡时间是00:14'])

capture_2给了我这个:

[(''。恐怖分子在23:45用一个远程爆炸装置摧毁了这座建筑。他于23:58从露台的阳台逃走。他没有活下来。死亡时间是00',':14 ','。警察在')后10分钟找到了他的尸体。

我想要以下句子: [(。恐怖分子在23:45用一个远程爆炸装置摧毁了这座建筑。他于23:58从露台的阳台逃走。死亡的时间是00:14')]

3 个答案:

答案 0 :(得分:1)

UPDATE2 刚想出一个模式。演示是HERE。希望它有所帮助:

(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])

说明:

(?:^|\s+)       Non-capturing group,
                match start of sentence, or 1 or more spaces
(               capturing group starts
[^.!?]*         0 or more times of characters except . ! or ?
(?:\d\d:\d\d)   Non-capturing group,
                match dd:dd time format
[^.!?]*         0 or more times of characters except . ! or ?
[.!?]           sentence ends with . ! or ?
)               capturing group ends

import re
text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
print  ' '.join( re.findall('(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])', text))

输出:

The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. Time of death was 00:14.

答案 1 :(得分:0)

(?:\.|\A)([^.]*\d*:\d*[^.]*)\.

它捕获两个句点之间或字符串开头和句点之间的所有字符串(因此您也可以捕获第一个句子)。如果您的字符串包含换行符,则需要使用re.DOTALL标志以确保.捕获新行。

例如:

re.findall("(?:\.|\A)([^.]*\d*:\d*[^.]*)\.", text, flags=re.DOTALL)

请注意,这将使您的所有句子同时包含您的关键字,因此无需逐句逐句。

编辑:

我已更改上面的正则表达式,以便在关键字紧邻.时捕获包含关键字EXCEPT的每个句子 如果我可以使用列表理解建议另一种技术:

[s for s in re.split('\.', text) if re.search('\d*:\d*', s)]

为您的示例返回:

[' The terrorist destroyed the building at 23:45 with a remote detonation device',' 
He escaped at 23:58 from the balcony of the terrace', 
'Time of death was 00:14']

请注意,如果您的文字包含.不是最终的句子,那么这仍会遇到问题。例如:“Magoo先生在12:34吃豆子和烤面包”将捕获:“Magoo在12:34吃豆子”并将错过“先生” 。

如果你遇到这个问题我会建议把它作为一个单独的问题。

答案 2 :(得分:0)

嗯,你可以用正则表达式轻松实现这一点。 (积极的观察和向前看)

以下是使用上述正则表达式的示例。

import re


def replace_keyword(start, end, data):
    if start == "":
        start = "^"

    if end == "":
        end = "$"

    rx = "(?<={0}).*(?={1})".format(start, end)
    match = re.search(rx, data, re.DOTALL | re.MULTILINE)
    if match:
        return match.group() + end
    else:
        return data


data = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"

# empty string means start searching from begining of string.
start = ""

# empty end string means, search until end of string.
end = "00:14"

data = replace_keyword(start, end, data)

print data

运行上面的代码后,data将包含文字

  他在那里。恐怖分子于23:45用远程爆炸装置摧毁了该建筑物。他于23:58从露台的阳台逃走了。他没有活下来。死亡时间是00:14

希望,它正在做你期待的事情