Question

是否可以捕获包含关键字（时间）的特定句子？例如：

`我想捕捉这部分（时间）和这部分。不是这句话，因为它不包含我们的关键字。但也是这句话，因为它包含（时间）'

- 注1：时间不在括号中，代表时间范围：例如：12：45,10：45等。

- 注2：我正在寻找一个正则表达式，当该关键字存在时捕获所有句子。如果findall函数没有在句子中找到关键字，那么它将继续到下一个句子。

- 注3：最后我们有一个包含特定关键词的句子总和。

我添加了一些其他信息。测试您提供的代码和文本。

text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"

capture_1 = re.findall("(?:\.|\A)(.*\d*:\d*.*)\.", text , flags=re.DOTALL)
capture_2 = re.findall(r'(\..*)(\d*:\d*)(.*) ',text, flags=re.DOTALL )

capture_1给了我这个：

['他在那里。恐怖分子于23:45用远程爆炸装置摧毁了该建筑物。他于23:58从露台的阳台逃走了。他没有活下来。死亡时间是00:14']）

capture_2给了我这个：

[（''。恐怖分子在23:45用一个远程爆炸装置摧毁了这座建筑。他于23:58从露台的阳台逃走。他没有活下来。死亡时间是00'，'：14 '，'。警察在'）后10分钟找到了他的尸体。

我想要以下句子： [（。恐怖分子在23:45用一个远程爆炸装置摧毁了这座建筑。他于23:58从露台的阳台逃走。死亡的时间是00:14'）]

Answer 1

UPDATE2 刚想出一个模式。演示是HERE。希望它有所帮助：

(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])

说明：

(?:^|\s+)       Non-capturing group,
                match start of sentence, or 1 or more spaces
(               capturing group starts
[^.!?]*         0 or more times of characters except . ! or ?
(?:\d\d:\d\d)   Non-capturing group,
                match dd:dd time format
[^.!?]*         0 or more times of characters except . ! or ?
[.!?]           sentence ends with . ! or ?
)               capturing group ends

import re
text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
print  ' '.join( re.findall('(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])', text))

输出：

The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. Time of death was 00:14.

Answer 2

(?:\.|\A)([^.]*\d*:\d*[^.]*)\.

它捕获两个句点之间或字符串开头和句点之间的所有字符串（因此您也可以捕获第一个句子）。如果您的字符串包含换行符，则需要使用re.DOTALL标志以确保.捕获新行。

例如：

re.findall("(?:\.|\A)([^.]*\d*:\d*[^.]*)\.", text, flags=re.DOTALL)

请注意，这将使您的所有句子同时包含您的关键字，因此无需逐句逐句。

编辑：

我已更改上面的正则表达式，以便在关键字紧邻.时捕获包含关键字EXCEPT的每个句子如果我可以使用列表理解建议另一种技术：

[s for s in re.split('\.', text) if re.search('\d*:\d*', s)]

为您的示例返回：

[' The terrorist destroyed the building at 23:45 with a remote detonation device',' 
He escaped at 23:58 from the balcony of the terrace', 
'Time of death was 00:14']

请注意，如果您的文字包含.不是最终的句子，那么这仍会遇到问题。例如：“Magoo先生在12:34吃豆子和烤面包”将捕获：“Magoo在12:34吃豆子”并将错过“先生” 。

如果你遇到这个问题我会建议把它作为一个单独的问题。

Answer 3

嗯，你可以用正则表达式轻松实现这一点。（积极的观察和向前看）

以下是使用上述正则表达式的示例。

import re


def replace_keyword(start, end, data):
    if start == "":
        start = "^"

    if end == "":
        end = "$"

    rx = "(?<={0}).*(?={1})".format(start, end)
    match = re.search(rx, data, re.DOTALL | re.MULTILINE)
    if match:
        return match.group() + end
    else:
        return data


data = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"

# empty string means start searching from begining of string.
start = ""

# empty end string means, search until end of string.
end = "00:14"

data = replace_keyword(start, end, data)

print data

运行上面的代码后，data将包含文字

他在那里。恐怖分子于23:45用远程爆炸装置摧毁了该建筑物。他于23:58从露台的阳台逃走了。他没有活下来。死亡时间是00:14

希望，它正在做你期待的事情

正则表达式 - Python - 捕获单词之间的所有内容

3 个答案:

编辑：