Question

我想将文本拆分为子句。我怎么能这样做？

例如：

text = "Hi, this is an apple. Hi, that is pineapple."

结果应为：

['Hi,',
 'this is an apple.',
 'Hi,',
 'that is pineapple.']

（P.S。我尝试使用string.split(r'[,.]')，但它会移除分隔符。）

Answer 1

Related question

Natural Language Toolkit提供了一个可用于分割句子的标记器。例如：

>>> import nltk
>>> nltk.download()   # enter "punkt"

>>> import nltk.data
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> data = "Hi, this is an apple. Hi, that is pineapple."
>>> data = data.replace(',', '.')
>>> tokenizer.tokenize(data)
['Hi.', 'this is an apple.', 'Hi.', 'that is pineapple.']

标记器的详细信息为documented here。

Answer 2

也许这也可以起作用：

text.replace(', ', ',, ').replace('. ', '., ').split(', ')

结果：

['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']

Answer 3

您可以使用a zero-length look-behind assertion \s+分隔空格(?<=[,.])作为标点符号。

import re

text = "Hi, this is an apple. Hi, that is pineapple."
subsentence = re.compile(r'(?<=[,.])\s+')

print(subsentence.split(text))

['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']

Answer 4

以下是使用github repository here的另一种可能解决方案：

import re

text = "Hi, this is an apple. Hi, that is pineapple."

punct_locs = [0] + [i.start() + 1 for i in re.finditer(r'[,.]', text)]

sentences = [text[start:end].strip() for start, end in zip(punct_locs[:-1], punct_locs[1:])]

print(sentences)

哪个输出：

['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']

Answer 5

为什么你通过导入繁重的模块使它太复杂，只需简单干净的方法而不导入任何模块：

text = "Hi, this is an apple. Hi, that is pineapple."
for i in text.split('.'):
    if i:
        print(i.strip().split(','))

输出：

['Hi', ' this is an apple']
['Hi', ' that is pineapple']

你可以在一行中完成：

text = "Hi, this is an apple. Hi, that is pineapple."
print([i.strip().split(',') for i in text.split('.') if i])

输出：

[['Hi', ' this is an apple'], ['Hi', ' that is pineapple']]

如何用python将文本拆分为子句？

5 个答案: