Python在句子上分割文本

时间:2011-01-01 22:19:00

标签: python text split

我有一个文本文件。我需要一个句子列表。

如何实施?有许多细微之处,例如缩写中使用点。

我的旧正则表达式很糟糕。

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

16 个答案:

答案 0 :(得分:125)

自然语言工具包(nltk.org)拥有您所需要的。 This group posting表明这样做:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(我还没试过!)

答案 1 :(得分:76)

这个函数可以在大约0.1秒内将Huckleberry Finn的整个文本分成句子,并处理许多令人痛苦的边缘情况,使句子解析变得非常重要,例如&#34; 先生。 John Johnson Jr.出生于美国,但获得博士学位。之前在以色列加入耐克公司担任工程师。他还在craigslist.org担任业务分析师。&#34;

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

答案 2 :(得分:20)

您可以使用nltk库,而不是使用正则表达式将文本拆分为句子。

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

参考:https://stackoverflow.com/a/9474645/2877052

答案 3 :(得分:7)

这是一种不依赖任何外部库的道路中间方法。我使用列表推导来排除缩写和终止符之间的重叠,以及排除终止变体之间的重叠,例如:'。'与'。''

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

我在此条目中使用了Karl的find_all函数: Find all occurrences of a substring in Python

答案 4 :(得分:5)

您可以尝试使用Spacy代替正则表达式。我使用它,它完成了这项工作。

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())

答案 5 :(得分:4)

对于简单的情况(句子正常终止),这应该有效:

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

正则表达式是*\. +,它与左边0个或更多个空格和右边1个或更多个空格所包围的句点相匹配(以防止re.split中的句点被视为句子中的变化)。

显然,这不是最强大的解决方案,但在大多数情况下都可以。唯一不包括的案例是缩写(也许是通过句子列表运行并检查sentences中的每个字符串是否以大写字母开头?)

答案 6 :(得分:1)

@Artyom,

嗨!你可以使用这个函数为俄语(和其他一些语言)创建一个新的标记器:

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('\"', ' \" ')
    result = result.replace('\'', ' \' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

然后以这种方式调用它:

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)
祝你好运, Marilena。

答案 7 :(得分:1)

毫无疑问,NLTK最适合此目的。但是开始使用NLTK非常痛苦(但是一旦你安装它 - 你只需要获得奖励)

所以这里是http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html

提供的简单基于代码的代码
# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question 

答案 8 :(得分:1)

此外,请注意上面某些答案中未包含的其他顶级域名。

例如,.info,.biz,.ru,.online会抛出一些句子解析器,但不在上面。

以下是有关顶级域名出现频率的信息:https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/

可以通过将上面的代码编辑为以下内容来解决:

alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|ai|edu|co.uk|ru|info|biz|online)"

答案 9 :(得分:0)

我必须阅读字幕文件并将它们分成句子。在预处理之后(如删除.srt文件中的时间信息等),变量fullFile包含字幕文件的全文。下面粗暴的方式整齐地将它们分成句子。可能我很幸运,句子总是以空格结束(正确)。首先尝试这个,如果有任何例外,请添加更多检查和余额。

# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
    sentFile.write (line);
    sentFile.write ("\n");
sentFile.close;

哦!好。我现在意识到,由于我的内容是西班牙语,我没有处理“史密斯先生”等问题。但是,如果有人想要一个快速而肮脏的解析器......

答案 10 :(得分:0)

您还可以在NLTK中使用句子标记化功能:

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)

答案 11 :(得分:0)

我希望这会帮助您处理拉丁文,中文,阿拉伯文

import re

punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|!|?|;|…| |!|؟|؛)+")
lines = []

with open('myData.txt','r',encoding="utf-8") as myFile:
    lines = punctuation.sub(r"\1\2<pad>", myFile.read())
    lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

答案 12 :(得分:0)

正在从事类似的任务并遇到此查询,通过跟踪一些链接并为nltk进行了一些练习,以下代码像魔术一样对我有效。

from nltk.tokenize import sent_tokenize 
  
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text) 

输出:

['Hello everyone.',
 'Welcome to GeeksforGeeks.',
 'You are studying NLP article']

来源:https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/

答案 13 :(得分:0)

导入空间

nlp = spacy.load('en_core_web_sm')

text =“你今天好吗?希望你过得愉快”

令牌= nlp(文本)

用于发送令牌:发送

print(sent.string.strip())

答案 14 :(得分:0)

不妨把它扔进去,因为这是第一个出现被 n 个句子分割的帖子。

这适用于可变分割长度,它表示最后连接在一起的句子。

import nltk
//nltk.download('punkt')
from more_itertools import windowed

split_length = 3 // 3 sentences for example 

elements = nltk.tokenize.sent_tokenize(text)
segments = windowed(elements, n=split_length, step=split_length)
text_splits = []
for seg in segments:
          txt = " ".join([t for t in seg if t])
          if len(txt) > 0:
                text_splits.append(txt)

答案 15 :(得分:0)

如果 NLTK 的 sent_tokenize 不是一个东西(例如,在长文本上需要大量 GPU RAM)并且正则表达式不能跨语言正常工作,sentence splitter 可能值得一试。