Question

我有以下简单的段落，我试图解析成句子：

(['2893357', 'SUPER', 'sesame street. The books are all open. I saw no trash or debris. She was clean and well organized.'], ['2893357', 'STELLAR', '"I stopped and turned it off. He was smiling. He welcomed me to Sewell and asked how he was able to assist me that day.'])

我的预期结果集应该是：

['2893357', 'SUPER', 'sesame street.']
['2893357', 'SUPER', 'The books are all open.']
['2893357', 'SUPER', 'I saw no trash or debris.']
['2893357', 'SUPER', 'She was clean and well organized.']
['2893357', 'STELLAR', '"I stopped and turned it off.']
['2893357', 'STELLAR', '"He was smiling.']
['2893357', 'STELLAR', '"He welcomed me to See and asked how he was able to assist me that day.']

我的代码如下：

sentences = list(data_set)         
for i,y in enumerate(sentences):
    pig = sentences[i]
    pig = [[pig[0], pig[1], y] for y in pig[2].split('. ')]
    sentences[i:i+1] = pig

谢谢。

Answer 1

您可以使用list comprehension和re.split：

>>> from re import split
>>> data = (['2893357', 'SUPER', 'sesame street. The books are all open. I saw no trash or debris. She was clean and well organized.'], ['2893357', 'STELLAR', '"I stopped and turned it off.
He was smiling. He welcomed me to Sewell and asked how he was able to assist me that day.'])
>>> new_list = [[w,x,z] for w,x,y in data for z in split("(?<=\.) ", y)]
>>> for item in new_list:
...     print(item)
...
['2893357', 'SUPER', 'sesame street.']
['2893357', 'SUPER', 'The books are all open.']
['2893357', 'SUPER', 'I saw no trash or debris.']
['2893357', 'SUPER', 'She was clean and well organized.']
['2893357', 'STELLAR', '"I stopped and turned it off.']
['2893357', 'STELLAR', 'He was smiling.']
['2893357', 'STELLAR', 'He welcomed me to Sewell and asked how he was able to assist me that day.']
>>>

但请注意，输出略微与您提供的样本输出不同。我认为这是因为你在写这篇文章时犯了一些拼写错误。例如，在最后一句中，您说您想要See。但是，样本数据中不会出现See。相反，它应该是Sewell。

Answer 2

如果你真的想要分开句子，你不应该使用split，因为它会在感叹号和缩写等方面失败，并且通常会有很多边缘情况你真的没有想要处理。

幸运的是，nltk只有一个名为punkt的实用工具，可以将段落分成句子。要使用punkt，请执行以下操作：

>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print(sent_detector.tokenize(text.strip()))
['Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries.',
'And sometimes sentences can start with non-capitalized words.', 
'i is a good variable name.']

从nltk的documentation.

借来的例子（和许多其他人）

对您的具体问题稍加应用：

import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
my_data = ['2893357', 'SUPER', 'sesame street. The books are all open. I saw no trash or debris. She was clean and well organized.']
tokens = sent_detector.tokenize(my_data[2])
print [[my_data[0], my_data[1], sentence] for sentence in tokens]

>>> [['2893357', 'SUPER', 'sesame street.'],
['2893357', 'SUPER', 'The books are all open.'],
['2893357', 'SUPER', 'I saw no trash or debris.'],
['2893357', 'SUPER', 'She was clean and well organized.']]

Answer 3

这是一种方法：

>>> data_set = (['2893357', 'SUPER', 'sesame street. The books are all open. I saw no trash or debris. She was clean and well organized.'], ['2893357', 'STELLAR', '"I stopped and turned it off. He was smiling. He welcomed me to Sewell and asked how he was able to assist me that day.'])
>>> for i, j, k in data_set:
...     for sentence in k.split("."):
...             sentence = sentence.strip()
...             if not sentence:
...                     continue
...             print [i, j, sentence + "."]
... 
['2893357', 'SUPER', 'sesame street.']
['2893357', 'SUPER', ' The books are all open.']
['2893357', 'SUPER', ' I saw no trash or debris.']
['2893357', 'SUPER', ' She was clean and well organized.']
['2893357', 'STELLAR', '"I stopped and turned it off.']
['2893357', 'STELLAR', ' He was smiling.']
['2893357', 'STELLAR', ' He welcomed me to Sewell and asked how he was able to assist me that day.']

将段落拆分为句子 - 需要排除故障

3 个答案: