Python Regex-提取包含相同关键字的多个句子

时间:2018-07-11 19:43:14

标签: python regex

import re

regex = r"[^.?!-]*(?<=[.?\s!-])\b(pfs)\b(?=[\s.?!-])[^.?!-]*[.?!-]"

test_str = "pfs alert conf . it is unlikely that we will sign it - pfs of $ 950 filed to driver - we are gathering information"

subst = ""

result = re.sub(regex, subst, test_str, 0, re.IGNORECASE | re.MULTILINE)

if result:
    print (result)

我们看到,test_str有两个包含关键字“ pfs”的句子。但是,上面的python代码只能提取第二句话'提交给驱动程序的$ 950 pfs,如何修改它以提取'pfs alert conf'?

2 个答案:

答案 0 :(得分:0)

考虑改用nltk,imo真的更适合这里:

from nltk import sent_tokenize

test_str = "pfs alert conf . it is unlikely that we will sign it - pfs of $ 950 filed to driver - we are gathering information. some junky words thereafter"
sentences = [sent for sent in sent_tokenize(test_str) if "pfs" in sent]
print(sentences)

这产生了(注意缺少pfs的最后一个句子):

['pfs alert conf .', 
 'it is unlikely that we will sign it - pfs of $ 950 filed to driver - we are gathering information.']

答案 1 :(得分:0)

第一个pfs在行的开头,但是在正向查找后,您将1个字符与(?<=[.?\s!-])匹配。您可以使用替换来断言行{{1}的开始}或^

[^.?!-]*(?<=[.?\s!-])

Regex demo

Demo python