是否可以捕获包含关键字(时间)的特定句子?例如:
`我想捕捉这部分(时间)和这部分。不是这句话,因为它不包含我们的关键字。但也是这句话,因为它包含(时间)'
- 注1:时间不在括号中,代表时间范围:例如:12:45,10:45等。
- 注2:我正在寻找一个正则表达式,当该关键字存在时捕获所有句子。如果findall函数没有在句子中找到关键字,那么它将继续到下一个句子。
- 注3:最后我们有一个包含特定关键词的句子总和。
我添加了一些其他信息。测试您提供的代码和文本。
text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
capture_1 = re.findall("(?:\.|\A)(.*\d*:\d*.*)\.", text , flags=re.DOTALL)
capture_2 = re.findall(r'(\..*)(\d*:\d*)(.*) ',text, flags=re.DOTALL )
capture_1给了我这个:
['他在那里。恐怖分子于23:45用远程爆炸装置摧毁了该建筑物。他于23:58从露台的阳台逃走了。他没有活下来。死亡时间是00:14'])
capture_2给了我这个:
[(''。恐怖分子在23:45用一个远程爆炸装置摧毁了这座建筑。他于23:58从露台的阳台逃走。他没有活下来。死亡时间是00',':14 ','。警察在')后10分钟找到了他的尸体。
我想要以下句子: [(。恐怖分子在23:45用一个远程爆炸装置摧毁了这座建筑。他于23:58从露台的阳台逃走。死亡的时间是00:14')]
答案 0 :(得分:1)
UPDATE2 刚想出一个模式。演示是HERE。希望它有所帮助:
(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])
说明:
(?:^|\s+) Non-capturing group,
match start of sentence, or 1 or more spaces
( capturing group starts
[^.!?]* 0 or more times of characters except . ! or ?
(?:\d\d:\d\d) Non-capturing group,
match dd:dd time format
[^.!?]* 0 or more times of characters except . ! or ?
[.!?] sentence ends with . ! or ?
) capturing group ends
import re
text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
print ' '.join( re.findall('(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])', text))
输出:
The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. Time of death was 00:14.
答案 1 :(得分:0)
(?:\.|\A)([^.]*\d*:\d*[^.]*)\.
它捕获两个句点之间或字符串开头和句点之间的所有字符串(因此您也可以捕获第一个句子)。如果您的字符串包含换行符,则需要使用re.DOTALL标志以确保.
捕获新行。
例如:
re.findall("(?:\.|\A)([^.]*\d*:\d*[^.]*)\.", text, flags=re.DOTALL)
请注意,这将使您的所有句子同时包含您的关键字,因此无需逐句逐句。
我已更改上面的正则表达式,以便在关键字紧邻.
时捕获包含关键字EXCEPT的每个句子
如果我可以使用列表理解建议另一种技术:
[s for s in re.split('\.', text) if re.search('\d*:\d*', s)]
为您的示例返回:
[' The terrorist destroyed the building at 23:45 with a remote detonation device','
He escaped at 23:58 from the balcony of the terrace',
'Time of death was 00:14']
请注意,如果您的文字包含.
不是最终的句子,那么这仍会遇到问题。例如:“Magoo先生在12:34吃豆子和烤面包”将捕获:“Magoo在12:34吃豆子”并将错过“先生” 。
如果你遇到这个问题我会建议把它作为一个单独的问题。
答案 2 :(得分:0)
嗯,你可以用正则表达式轻松实现这一点。 (积极的观察和向前看)
以下是使用上述正则表达式的示例。
import re
def replace_keyword(start, end, data):
if start == "":
start = "^"
if end == "":
end = "$"
rx = "(?<={0}).*(?={1})".format(start, end)
match = re.search(rx, data, re.DOTALL | re.MULTILINE)
if match:
return match.group() + end
else:
return data
data = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
# empty string means start searching from begining of string.
start = ""
# empty end string means, search until end of string.
end = "00:14"
data = replace_keyword(start, end, data)
print data
运行上面的代码后,data
将包含文字
他在那里。恐怖分子于23:45用远程爆炸装置摧毁了该建筑物。他于23:58从露台的阳台逃走了。他没有活下来。死亡时间是00:14
希望,它正在做你期待的事情