将段落拆分为句子 - 需要排除故障

时间:2014-01-02 21:29:33

标签: python list

我有以下简单的段落,我试图解析成句子:

(['2893357', 'SUPER', 'sesame street. The books are all open. I saw no trash or debris. She was clean and well organized.'], ['2893357', 'STELLAR', '"I stopped and turned it off. He was smiling. He welcomed me to Sewell and asked how he was able to assist me that day.'])

我的预期结果集应该是:

['2893357', 'SUPER', 'sesame street.']
['2893357', 'SUPER', 'The books are all open.']
['2893357', 'SUPER', 'I saw no trash or debris.']
['2893357', 'SUPER', 'She was clean and well organized.']
['2893357', 'STELLAR', '"I stopped and turned it off.']
['2893357', 'STELLAR', '"He was smiling.']
['2893357', 'STELLAR', '"He welcomed me to See and asked how he was able to assist me that day.']

我的代码如下:

sentences = list(data_set)         
for i,y in enumerate(sentences):
    pig = sentences[i]
    pig = [[pig[0], pig[1], y] for y in pig[2].split('. ')]
    sentences[i:i+1] = pig

谢谢。

3 个答案:

答案 0 :(得分:1)

您可以使用list comprehensionre.split

>>> from re import split
>>> data = (['2893357', 'SUPER', 'sesame street. The books are all open. I saw no trash or debris. She was clean and well organized.'], ['2893357', 'STELLAR', '"I stopped and turned it off.
He was smiling. He welcomed me to Sewell and asked how he was able to assist me that day.'])
>>> new_list = [[w,x,z] for w,x,y in data for z in split("(?<=\.) ", y)]
>>> for item in new_list:
...     print(item)
...
['2893357', 'SUPER', 'sesame street.']
['2893357', 'SUPER', 'The books are all open.']
['2893357', 'SUPER', 'I saw no trash or debris.']
['2893357', 'SUPER', 'She was clean and well organized.']
['2893357', 'STELLAR', '"I stopped and turned it off.']
['2893357', 'STELLAR', 'He was smiling.']
['2893357', 'STELLAR', 'He welcomed me to Sewell and asked how he was able to assist me that day.']
>>>

但请注意,输出略微与您提供的样本输出不同。我认为这是因为你在写这篇文章时犯了一些拼写错误。例如,在最后一句中,您说您想要See。但是,样本数据中不会出现See。相反,它应该是Sewell

答案 1 :(得分:1)

如果你真的想要分开句子,你不应该使用split,因为它会在感叹号和缩写等方面失败,并且通常会有很多边缘情况你真的没有想要处理。

幸运的是,nltk只有一个名为punkt的实用工具,可以将段落分成句子。要使用punkt,请执行以下操作:

>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print(sent_detector.tokenize(text.strip()))
['Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries.',
'And sometimes sentences can start with non-capitalized words.', 
'i is a good variable name.']

从nltk的documentation.

借来的例子(和许多其他人)

对您的具体问题稍加应用:

import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
my_data = ['2893357', 'SUPER', 'sesame street. The books are all open. I saw no trash or debris. She was clean and well organized.']
tokens = sent_detector.tokenize(my_data[2])
print [[my_data[0], my_data[1], sentence] for sentence in tokens]

>>> [['2893357', 'SUPER', 'sesame street.'],
['2893357', 'SUPER', 'The books are all open.'],
['2893357', 'SUPER', 'I saw no trash or debris.'],
['2893357', 'SUPER', 'She was clean and well organized.']]

答案 2 :(得分:0)

这是一种方法:

>>> data_set = (['2893357', 'SUPER', 'sesame street. The books are all open. I saw no trash or debris. She was clean and well organized.'], ['2893357', 'STELLAR', '"I stopped and turned it off. He was smiling. He welcomed me to Sewell and asked how he was able to assist me that day.'])
>>> for i, j, k in data_set:
...     for sentence in k.split("."):
...             sentence = sentence.strip()
...             if not sentence:
...                     continue
...             print [i, j, sentence + "."]
... 
['2893357', 'SUPER', 'sesame street.']
['2893357', 'SUPER', ' The books are all open.']
['2893357', 'SUPER', ' I saw no trash or debris.']
['2893357', 'SUPER', ' She was clean and well organized.']
['2893357', 'STELLAR', '"I stopped and turned it off.']
['2893357', 'STELLAR', ' He was smiling.']
['2893357', 'STELLAR', ' He welcomed me to Sewell and asked how he was able to assist me that day.']