我正在尝试标记文本
from nltk.tokenize import sent_tokenize, word_tokenize
text = '''The team used archive "data" from 2016...and 2017
captured by the ESA/NASA Hubble Space Telescope and developed
open-source algorithms to analyse the starlight filtered through
K2-18b’s atmosphere. The results revealed the molecular
signature of water vapour, also indicating the presence of
hydrogen and helium in the planet’s atmosphere.'''
token = (sent_tokenize(text))
token
这给了我
['The team used archive "data" from 2016...and 2017 captured by the ESA/NASA Hubble Space Telescope and developed open-source algorithms to analyse the starlight filtered through K2-18b’s atmosphere.',
'The results revealed the molecular signature of water vapour, also indicating the presence of hydrogen and helium in the planet’s atmosphere.']
如何将其转换为字符串,但在每个句子周围都保留''?
我发现的所有内容都将列表中的元素连接起来,并删除了标记化。
编辑:我本质上希望此输出如下。解析后,python会将.\n
视为新行吗? (请注意,我从可读性python page
text = ('This is sentence one .\n'
'This is sentence two \n.')
谢谢
答案 0 :(得分:1)
根据当前在OP中拥有的信息,您可以尝试以下操作:
a = ['sentence 1', 'sentence 2', 'let me guess... a third sentence?']
s = str(a).replace('[', '').replace(']', '').replace(', ', '\n').replace(',', '\n')
print(s)
这将输出:
$ python p.py
'sentence 1'
'sentence 2'
'let me guess... a third sentence?'
请注意replace(', ', '\n')
和replace(',', '\n')
的使用。