我有以下简单的段落,我试图解析成句子:
(['2893357', 'SUPER', 'sesame street. The books are all open. I saw no trash or debris. She was clean and well organized.'], ['2893357', 'STELLAR', '"I stopped and turned it off. He was smiling. He welcomed me to Sewell and asked how he was able to assist me that day.'])
我的预期结果集应该是:
['2893357', 'SUPER', 'sesame street.']
['2893357', 'SUPER', 'The books are all open.']
['2893357', 'SUPER', 'I saw no trash or debris.']
['2893357', 'SUPER', 'She was clean and well organized.']
['2893357', 'STELLAR', '"I stopped and turned it off.']
['2893357', 'STELLAR', '"He was smiling.']
['2893357', 'STELLAR', '"He welcomed me to See and asked how he was able to assist me that day.']
我的代码如下:
sentences = list(data_set)
for i,y in enumerate(sentences):
pig = sentences[i]
pig = [[pig[0], pig[1], y] for y in pig[2].split('. ')]
sentences[i:i+1] = pig
谢谢。
答案 0 :(得分:1)
您可以使用list comprehension和re.split
:
>>> from re import split
>>> data = (['2893357', 'SUPER', 'sesame street. The books are all open. I saw no trash or debris. She was clean and well organized.'], ['2893357', 'STELLAR', '"I stopped and turned it off.
He was smiling. He welcomed me to Sewell and asked how he was able to assist me that day.'])
>>> new_list = [[w,x,z] for w,x,y in data for z in split("(?<=\.) ", y)]
>>> for item in new_list:
... print(item)
...
['2893357', 'SUPER', 'sesame street.']
['2893357', 'SUPER', 'The books are all open.']
['2893357', 'SUPER', 'I saw no trash or debris.']
['2893357', 'SUPER', 'She was clean and well organized.']
['2893357', 'STELLAR', '"I stopped and turned it off.']
['2893357', 'STELLAR', 'He was smiling.']
['2893357', 'STELLAR', 'He welcomed me to Sewell and asked how he was able to assist me that day.']
>>>
但请注意,输出略微与您提供的样本输出不同。我认为这是因为你在写这篇文章时犯了一些拼写错误。例如,在最后一句中,您说您想要See
。但是,样本数据中不会出现See
。相反,它应该是Sewell
。
答案 1 :(得分:1)
如果你真的想要分开句子,你不应该使用split
,因为它会在感叹号和缩写等方面失败,并且通常会有很多边缘情况你真的没有想要处理。
幸运的是,nltk
只有一个名为punkt
的实用工具,可以将段落分成句子。要使用punkt
,请执行以下操作:
>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print(sent_detector.tokenize(text.strip()))
['Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries.',
'And sometimes sentences can start with non-capitalized words.',
'i is a good variable name.']
从nltk的documentation.
借来的例子(和许多其他人)对您的具体问题稍加应用:
import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
my_data = ['2893357', 'SUPER', 'sesame street. The books are all open. I saw no trash or debris. She was clean and well organized.']
tokens = sent_detector.tokenize(my_data[2])
print [[my_data[0], my_data[1], sentence] for sentence in tokens]
>>> [['2893357', 'SUPER', 'sesame street.'],
['2893357', 'SUPER', 'The books are all open.'],
['2893357', 'SUPER', 'I saw no trash or debris.'],
['2893357', 'SUPER', 'She was clean and well organized.']]
答案 2 :(得分:0)
这是一种方法:
>>> data_set = (['2893357', 'SUPER', 'sesame street. The books are all open. I saw no trash or debris. She was clean and well organized.'], ['2893357', 'STELLAR', '"I stopped and turned it off. He was smiling. He welcomed me to Sewell and asked how he was able to assist me that day.'])
>>> for i, j, k in data_set:
... for sentence in k.split("."):
... sentence = sentence.strip()
... if not sentence:
... continue
... print [i, j, sentence + "."]
...
['2893357', 'SUPER', 'sesame street.']
['2893357', 'SUPER', ' The books are all open.']
['2893357', 'SUPER', ' I saw no trash or debris.']
['2893357', 'SUPER', ' She was clean and well organized.']
['2893357', 'STELLAR', '"I stopped and turned it off.']
['2893357', 'STELLAR', ' He was smiling.']
['2893357', 'STELLAR', ' He welcomed me to Sewell and asked how he was able to assist me that day.']