我正在使用以下代码来计算文档文件中的短语数:
phrases = ['yellow bananas']
clean_text = " ".join(re.findall(r'\w+(?:-\w+)*', doc))
for phrase in phrases:
if phrase in clean_text:
if phrase not in list_of_phrases:
list_of_phrases[phrase] = clean_text.count(phrase)
else:
list_of_phrases[phrase] += clean_text.count(phrase)
问题是,不是要获得整个句子,而是要在要搜索的关键字之前/之后得到一个,两个,三个等单词吗?
编辑:
示例文档:
Yellow bananas are nice. I like fruits. Nobody knows how many fruits there are out there. There are yellow bananas and many other fruits. Bananas, apples, oranges, mangos.
输出将是包含关键字的词组计数,例如在这种情况下,关键字之前和之后分别带有1,2,3等单词的“黄色香蕉”。