我试图将文字拆分成句子,只要有终端标点符号('。','!','?&#39 ;)出现。例如,如果我有以下文字:
认识到耶路撒冷风险投资合伙人开启的机会越来越大 他们的网络实验室孵化器,为这个城市的许多人提供了一个家 有前途的年轻公司。 EMC等国际企业也有 在公园内建立了主要的中心,为其他人提供了便利 关注!去年六月访问时,公园已经增长到两个 地面被破坏的建筑物为更多的建筑物 近乎未来。这真是有趣! 你怎么看?
这应该分成5个句子(见上面的粗体字,因为这些单词以标点符号结尾)。
这是我的代码:
# split on: '.+'
splitted_article_content = []
# article_content contains all the article's paragraphs
for element in article_content:
splitted_article_content = splitted_article_content +re.split(".(?='.'+)", element)
# split on: '?+'
splitted_article_content_2 = []
for element in splitted_article_content:
splitted_article_content_2 = splitted_article_content_2 + re.split(".(?='?'+)", element)
# split on: '!+'
splitted_article_content_3 = []
for element in splitted_article_content_2:
splitted_article_content_3 = splitted_article_content_3 + re.split(".(?='!'+)", element)
我的问题是,有没有其他有效的方法来执行以下操作,没有使用任何外部库?
感谢帮助人员。
答案 0 :(得分:1)
我想我认为这看起来更像是一个后视,而不是展望未来:
import re
# article_content contains all the article's paragraphs
# in this case, a single paragraph.
article_content = ["""Recognizing the rising opportunity Jerusalem Venture Partners opened up their Cyber Labs incubator, giving a home to many of the city’s promising young companies. International corporates like EMC have also established major centers in the park, leading the way for others to follow! On a visit last June, the park had already grown to two buildings with the ground being broken for the construction of more in the near future. This is really interesting! What do you think?"""]
split_article_content = []
for element in article_content:
split_article_content += re.split("(?<=[.!?])\s+", element)
print(*split_article_content, sep='\n\n')
输出
% python3 test.py
Recognizing the rising opportunity Jerusalem Venture Partners opened up their Cyber Labs incubator, giving a home to many of the city’s promising young companies.
International corporates like EMC have also established major centers in the park, leading the way for others to follow!
On a visit last June, the park had already grown to two buildings with the ground being broken for the construction of more in the near future.
This is really interesting!
What do you think?
%