如何使用Regex将此文本标记为句子

时间:2017-05-26 19:51:00

标签: python regex tokenize

  

“亲爱的沃森,你不可能在更好的时间过来,”   他亲切地说。 “你不值得等待,”她走了    on。“你可以通过门;没有人会阻碍。” 然后,看到我微笑并摇了摇头,她突然把她扔到了一边   约束并向前迈出了一步,双手紧握在一起。

查看突出显示的区域。我怎么可能区分一个'''后跟一个句点(。)结束一个句子的情况和一个句点(。)后跟一个'''

的情况

我已经尝试过这个标记器。除了那一部分外,它的效果很好。

(([^।\.?!]|[।\.?!](?=[\"\']))+\s*[।\.?!]\s*)

编辑:我不打算使用任何NLP工具包来解决此问题。

2 个答案:

答案 0 :(得分:1)

在此使用NLTK代替正则表达式:

from nltk import sent_tokenize
parts = sent_tokenize(your_string)
# ['"You could not possibly have come at a better time, my dear Watson," he said cordially.', "'It is not worth your while to wait,' she went on.", '"You can pass through the door; no one hinders."', 'And then, seeing that I smiled and shook my head, she suddenly threw aside her constraint and made a step forward, with her hands wrung together.']

答案 1 :(得分:0)

不久前找到这个功能

def split_into_sentences(text):

caps = u"([A-Z])"
prefixes = u"(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = u"(Inc|Ltd|Jr|Sr|Co)"
starters = u"(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = u"([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = u"[.](com|net|org|io|gov|mobi|info|edu)"

if not isinstance(text,unicode):
    text = text.decode('utf-8')

text = u" {0} ".format(text)

text = text.replace(u"\n",u" ")
text = re.sub(prefixes,u"\\1<prd>",text)
text = re.sub(websites,u"<prd>\\1",text)
if u"Ph.D" in text: text = text.replace(u"Ph.D.",u"Ph<prd>D<prd>")
text = re.sub(u"\s" + caps + u"[.] ",u" \\1<prd> ",text)
text = re.sub(acronyms+u" "+starters,u"\\1<stop> \\2",text)
text = re.sub(caps + u"[.]" + caps + u"[.]" + caps + u"[.]",u"\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(caps + u"[.]" + caps + u"[.]",u"\\1<prd>\\2<prd>",text)
text = re.sub(u" "+suffixes+u"[.] "+starters,u" \\1<stop> \\2",text)
text = re.sub(u" "+suffixes+u"[.]",u" \\1<prd>",text)
text = re.sub(u" " + caps + u"[.]",u" \\1<prd>",text)
if u"\"" in text: text = text.replace(u".\"",u"\".")
if u"!" in text: text = text.replace(u"!\"",u"\"!")
if u"?" in text: text = text.replace(u"?\"",u"\"?")
text = text.replace(u".",u".<stop>")
text = text.replace(u"?",u"?<stop>")
text = text.replace(u"!",u"!<stop>")
text = text.replace(u"<prd>",u".")
sentences = text.split(u"<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences