如何用python将文本拆分为子句?

时间:2017-12-24 22:17:54

标签: python nlp nltk

我想将文本拆分为子句。我怎么能这样做?

例如:

text = "Hi, this is an apple. Hi, that is pineapple."

结果应为:

['Hi,',
 'this is an apple.',
 'Hi,',
 'that is pineapple.']

(P.S。我尝试使用string.split(r'[,.]'),但它会移除分隔符。)

5 个答案:

答案 0 :(得分:3)

Related question

Natural Language Toolkit提供了一个可用于分割句子的标记器。例如:

>>> import nltk
>>> nltk.download()   # enter "punkt"

>>> import nltk.data
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> data = "Hi, this is an apple. Hi, that is pineapple."
>>> data = data.replace(',', '.')
>>> tokenizer.tokenize(data)
['Hi.', 'this is an apple.', 'Hi.', 'that is pineapple.']

标记器的详细信息为documented here

答案 1 :(得分:3)

也许这也可以起作用:

text.replace(', ', ',, ').replace('. ', '., ').split(', ')

结果:

['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']

答案 2 :(得分:2)

您可以使用a zero-length look-behind assertion \s+分隔空格(?<=[,.])作为标点符号。

import re

text = "Hi, this is an apple. Hi, that is pineapple."
subsentence = re.compile(r'(?<=[,.])\s+')

print(subsentence.split(text))

['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']

答案 3 :(得分:0)

以下是使用github repository here的另一种可能解决方案:

import re

text = "Hi, this is an apple. Hi, that is pineapple."

punct_locs = [0] + [i.start() + 1 for i in re.finditer(r'[,.]', text)]

sentences = [text[start:end].strip() for start, end in zip(punct_locs[:-1], punct_locs[1:])]

print(sentences)

哪个输出:

['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']

答案 4 :(得分:0)

为什么你通过导入繁重的模块使它太复杂,只需简单干净的方法而不导入任何模块:

text = "Hi, this is an apple. Hi, that is pineapple."
for i in text.split('.'):
    if i:
        print(i.strip().split(','))

输出:

['Hi', ' this is an apple']
['Hi', ' that is pineapple']
  

你可以在一行中完成:

text = "Hi, this is an apple. Hi, that is pineapple."
print([i.strip().split(',') for i in text.split('.') if i])

输出:

[['Hi', ' this is an apple'], ['Hi', ' that is pineapple']]