我想将文本拆分为子句。我怎么能这样做?
例如:
text = "Hi, this is an apple. Hi, that is pineapple."
结果应为:
['Hi,',
'this is an apple.',
'Hi,',
'that is pineapple.']
(P.S。我尝试使用string.split(r'[,.]')
,但它会移除分隔符。)
答案 0 :(得分:3)
Natural Language Toolkit提供了一个可用于分割句子的标记器。例如:
>>> import nltk
>>> nltk.download() # enter "punkt"
>>> import nltk.data
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> data = "Hi, this is an apple. Hi, that is pineapple."
>>> data = data.replace(',', '.')
>>> tokenizer.tokenize(data)
['Hi.', 'this is an apple.', 'Hi.', 'that is pineapple.']
标记器的详细信息为documented here。
答案 1 :(得分:3)
也许这也可以起作用:
text.replace(', ', ',, ').replace('. ', '., ').split(', ')
结果:
['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']
答案 2 :(得分:2)
您可以使用a zero-length look-behind assertion \s+
分隔空格(?<=[,.])
作为标点符号。
import re
text = "Hi, this is an apple. Hi, that is pineapple."
subsentence = re.compile(r'(?<=[,.])\s+')
print(subsentence.split(text))
['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']
答案 3 :(得分:0)
以下是使用github repository here的另一种可能解决方案:
import re
text = "Hi, this is an apple. Hi, that is pineapple."
punct_locs = [0] + [i.start() + 1 for i in re.finditer(r'[,.]', text)]
sentences = [text[start:end].strip() for start, end in zip(punct_locs[:-1], punct_locs[1:])]
print(sentences)
哪个输出:
['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']
答案 4 :(得分:0)
为什么你通过导入繁重的模块使它太复杂,只需简单干净的方法而不导入任何模块:
text = "Hi, this is an apple. Hi, that is pineapple."
for i in text.split('.'):
if i:
print(i.strip().split(','))
输出:
['Hi', ' this is an apple']
['Hi', ' that is pineapple']
你可以在一行中完成:
text = "Hi, this is an apple. Hi, that is pineapple."
print([i.strip().split(',') for i in text.split('.') if i])
输出:
[['Hi', ' this is an apple'], ['Hi', ' that is pineapple']]