我想从字符串中创建句子列表然后将其打印出来。我不想用NLTK来做这件事。因此,它需要在句子末尾的句点上进行拆分,而不是在名称的小数或缩写或标题上,或者如果句子有.com这是尝试正则表达式不起作用。
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
for stuff in sentences:
print(stuff)
示例输出的示例
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
答案 0 :(得分:27)
好的,所以我使用正则表达式,nltk,CoreNLP详细查看了句子标记符。你最终编写自己的,这取决于应用程序。这些东西很棘手且很有价值,而且人们不只是将其令牌化代码放弃。 (最终,标记化不是一个确定性的过程,它是概率性的,并且还非常依赖于您的语料库或域名,例如社交媒体帖子与Yelp评论对...)
一般来说,你不能依赖一个单一的Great White绝对可靠的正则表达式,你必须编写一个使用几个正则数据(正面和负面)的函数;还有一个缩写词典,以及一些基本语言解析,它们知道例如'我','美国','FCC','TARP'用英语大写。
为了说明这是多么容易变得非常复杂,让我们试着写一个确定性标记符 的功能规范来决定是单个还是多个句点('。'/'。 ..')表示句末,或其他:
function isEndOfSentence(leftContext, rightContext)
在简单(确定性)的情况下,function isEndOfSentence(leftContext, rightContext)
将返回布尔值,但在更一般意义上,它是概率性的:它返回一个浮点数0.0-1.0(该特定'。'是一个句子结束的置信度)。
参考文献:[a] Coursera视频:“基本文本处理2-5 - 句子分割 - 斯坦福大学NLP - Dan Jurafsky教授和Chris Manning”[UPDATE: an unofficial version used to be on YouTube, was taken down]
答案 1 :(得分:27)
答案 2 :(得分:4)
尝试根据空格而不是点或?
分割输入,如果您这样做,则点或?
不会在最终结果中打印。< / p>
>>> import re
>>> s = """Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."""
>>> m = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', s)
>>> for i in m:
... print i
...
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
答案 3 :(得分:2)
sent = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)',text)
for s in sent:
print s
这里使用的正则表达式是:(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)
第一个阻止:(?<!\w\.\w.)
:此模式在负反馈循环(?<!)
中搜索所有单词(\w)
,然后是fullstop (\.)
,后跟其他单词{{1} }
第二个阻止:(\.)
:此模式在负反馈循环中搜索以大写字母(?<![A-Z][a-z]\.)
开头的任何内容,然后是小写字母([A-Z])
,直到点([a-z])
找到了。
第三个阻止:(\.)
:此模式在点(?<=\.|\?)
或问号(\.)
第四个块:(\?)
:此模式在第三个块的点或问号后搜索。它搜索空格(\s|[A-Z].*)
或以大写字母(\s)
开头的任何字符序列。
如果输入为
Hello world.Hi我今天在这里。
即。如果点后面有空格或没有空格。
答案 4 :(得分:0)
试试这个:
(?<!\b(?:[A-Z][a-z]|\d|[i.e]))\.(?!\b(?:com|\d+)\b)
答案 5 :(得分:0)
天真的方法,适当的英语句子不是从非alphas开始,不包含引用的词性:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
NonEndings = re.compile(r'(?:Mrs?|Jr|i\.e)\.\s*$')
parts = EndPunctuation.split(text)
sentence = []
for part in parts:
if len(part) and len(sentence) and EndPunctuation.match(sentence[-1]) and not NonEndings.search(''.join(sentence)):
print(''.join(sentence))
sentence = []
if len(part):
sentence.append(part)
if len(sentence):
print(''.join(sentence))
通过稍微扩展NonEndings可以减少误报分裂。其他情况需要额外的代码。使用这种方法很难以合理的方式处理拼写错误。
这种方法永远不会达到完美。但根据任务,它可能只是“足够”......
答案 6 :(得分:0)
我写这篇文章考虑了上面的smci评论。这是一种中间方法,不需要外部库,也不使用正则表达式。它允许您提供一个缩写和帐户的缩写和帐户,这些句子由封装器中的终结符结束,例如句点和引用:[。&#34;,?&#39;,。)]。
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior', 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']
def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences
def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)
我在此条目中使用了Karl的find_all函数:Find all occurrences of a substring in Python
答案 7 :(得分:0)
我在正则表达方面不是很出色,而是一个更简单的版本,&#34;蛮力&#34;实际上,上面是
sentence = re.compile("([\'\"][A-Z]|([A-Z][a-z]*\. )|[A-Z])(([a-z]*\.[a-z]*\.)|([A-Za-z0-9]*\.[A-Za-z0-9])|([A-Z][a-z]*\. [A-Za-z]*)|[^\.?]|[A-Za-z])*[\.?]")
这意味着
开始可接受的单位是&#39; [A-Z]或&#34; [A-Z]
请注意,大多数正则表达式都是贪婪的,因此当我们执行 | (或)时,顺序非常重要。那就是为什么我先写了即正则表达式,然后才会出现像 Inc。
答案 8 :(得分:0)
我的示例基于适用于巴西葡萄牙语的Ali的示例。谢谢阿里。
ABREVIACOES = ['sra?s?', 'exm[ao]s?', 'ns?', 'nos?', 'doc', 'ac', 'publ', 'ex', 'lv', 'vlr?', 'vls?',
'exmo(a)', 'ilmo(a)', 'av', 'of', 'min', 'livr?', 'co?ls?', 'univ', 'resp', 'cli', 'lb',
'dra?s?', '[a-z]+r\(as?\)', 'ed', 'pa?g', 'cod', 'prof', 'op', 'plan', 'edf?', 'func', 'ch',
'arts?', 'artigs?', 'artg', 'pars?', 'rel', 'tel', 'res', '[a-z]', 'vls?', 'gab', 'bel',
'ilm[oa]', 'parc', 'proc', 'adv', 'vols?', 'cels?', 'pp', 'ex[ao]', 'eg', 'pl', 'ref',
'[0-9]+', 'reg', 'f[ilí]s?', 'inc', 'par', 'alin', 'fts', 'publ?', 'ex', 'v. em', 'v.rev']
ABREVIACOES_RGX = re.compile(r'(?:{})\.\s*$'.format('|\s'.join(ABREVIACOES)), re.IGNORECASE)
def sentencas(texto, min_len=5):
# baseado em https://stackoverflow.com/questions/25735644/python-regex-for-splitting-text-into-sentences-sentence-tokenizing
texto = re.sub(r'\s\s+', ' ', texto)
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
# print(NonEndings)
parts = EndPunctuation.split(texto)
sentencas = []
sentence = []
for part in parts:
txt_sent = ''.join(sentence)
q_len = len(txt_sent)
if len(part) and len(sentence) and q_len >= min_len and \
EndPunctuation.match(sentence[-1]) and \
not ABREVIACOES_RGX.search(txt_sent):
sentencas.append(txt_sent)
sentence = []
if len(part):
sentence.append(part)
if sentence:
sentencas.append(''.join(sentence))
return sentencas
答案 9 :(得分:-1)
如果您想在3个时段分解句子(不确定这是否是您想要的),您可以使用此常规表达式:
import re text = """\ Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. """ sentences = re.split(r'\.{3}', text) for stuff in sentences: print(stuff)