Question

我正在尝试标记文本

from nltk.tokenize import sent_tokenize, word_tokenize 

text = '''The team used archive "data" from 2016...and 2017 
captured by the ESA/NASA Hubble Space Telescope and developed 
open-source algorithms to analyse the starlight filtered through 
K2-18b’s atmosphere. The results revealed the molecular 
signature of water vapour, also indicating the presence of 
hydrogen and helium in the planet’s atmosphere.'''

token = (sent_tokenize(text))
token

这给了我

['The team used archive "data" from 2016...and 2017 captured by the ESA/NASA Hubble Space Telescope and developed open-source algorithms to analyse the starlight filtered through K2-18b’s atmosphere.',
 'The results revealed the molecular signature of water vapour, also indicating the presence of hydrogen and helium in the planet’s atmosphere.']

如何将其转换为字符串，但在每个句子周围都保留''？

我发现的所有内容都将列表中的元素连接起来，并删除了标记化。

编辑：我本质上希望此输出如下。解析后，python会将.\n视为新行吗？（请注意，我从可读性python page

中获得了这种形式的标记化

text = ('This is sentence one .\n' 
'This is sentence two \n.')

谢谢

Answer 1

根据当前在OP中拥有的信息，您可以尝试以下操作：

a = ['sentence 1', 'sentence 2', 'let me guess... a third sentence?']

s = str(a).replace('[', '').replace(']', '').replace(', ', '\n').replace(',', '\n')
print(s)

这将输出：

$ python p.py
'sentence 1'
'sentence 2'
'let me guess... a third sentence?'

请注意replace(', ', '\n')和replace(',', '\n')的使用。

Python标记化文本：如何将标记化列表转换为字符串？

1 个答案: