Python标记化文本:如何将标记化列表转换为字符串?

时间:2019-11-25 22:41:56

标签: python nltk tokenize

我正在尝试标记文本

from nltk.tokenize import sent_tokenize, word_tokenize 

text = '''The team used archive "data" from 2016...and 2017 
captured by the ESA/NASA Hubble Space Telescope and developed 
open-source algorithms to analyse the starlight filtered through 
K2-18b’s atmosphere. The results revealed the molecular 
signature of water vapour, also indicating the presence of 
hydrogen and helium in the planet’s atmosphere.'''

token = (sent_tokenize(text))
token

这给了我

['The team used archive "data" from 2016...and 2017 captured by the ESA/NASA Hubble Space Telescope and developed open-source algorithms to analyse the starlight filtered through K2-18b’s atmosphere.',
 'The results revealed the molecular signature of water vapour, also indicating the presence of hydrogen and helium in the planet’s atmosphere.']

如何将其转换为字符串,但在每个句子周围都保留''?

我发现的所有内容都将列表中的元素连接起来,并删除了标记化。

编辑:我本质上希望此输出如下。解析后,python会将.\n视为新行吗? (请注意,我从可读性python page

中获得了这种形式的标记化
text = ('This is sentence one .\n' 
'This is sentence two \n.')

谢谢

1 个答案:

答案 0 :(得分:1)

根据当前在OP中拥有的信息,您可以尝试以下操作:

a = ['sentence 1', 'sentence 2', 'let me guess... a third sentence?']

s = str(a).replace('[', '').replace(']', '').replace(', ', '\n').replace(',', '\n')
print(s)

这将输出:

$ python p.py
'sentence 1'
'sentence 2'
'let me guess... a third sentence?'

请注意replace(', ', '\n')replace(',', '\n')的使用。