import re
regex = r"[^.?!-]*(?<=[.?\s!-])\b(pfs)\b(?=[\s.?!-])[^.?!-]*[.?!-]"
test_str = "pfs alert conf . it is unlikely that we will sign it - pfs of $ 950 filed to driver - we are gathering information"
subst = ""
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE | re.MULTILINE)
if result:
print (result)
我们看到,test_str有两个包含关键字“ pfs”的句子。但是,上面的python代码只能提取第二句话'提交给驱动程序的$ 950 pfs,如何修改它以提取'pfs alert conf'?
答案 0 :(得分:0)
考虑改用nltk
,imo真的更适合这里:
from nltk import sent_tokenize
test_str = "pfs alert conf . it is unlikely that we will sign it - pfs of $ 950 filed to driver - we are gathering information. some junky words thereafter"
sentences = [sent for sent in sent_tokenize(test_str) if "pfs" in sent]
print(sentences)
这产生了(注意缺少pfs
的最后一个句子):
['pfs alert conf .',
'it is unlikely that we will sign it - pfs of $ 950 filed to driver - we are gathering information.']
答案 1 :(得分:0)
第一个pfs
在行的开头,但是在正向查找后,您将1个字符与(?<=[.?\s!-])
匹配。您可以使用替换来断言行{{1}的开始}或^
[^.?!-]*(?<=[.?\s!-])