使用nltk word_tokenize进行标记后,重新加入原来的句子

时间:2019-07-02 16:08:37

标签: python nltk tokenize

如果我用AgentID,TeamID 1058,2546 2018,1155 2020,1155 2021,1006 拆分了一个句子,然后又用output = "2546, 1058, ['customer', 'great'] 1155, 2020, ['customer', 'stupid', 'hope', 'phone', 'back'] 1006, 2021, ['Line', 'number', '3'] 1006, 2021, ['customer', 'nerves'] 1155, 2018, ['stupid', 'isnt', 'even', 'start']" 重新加入了句子,那将与原始句子不完全一样,因为其中带有标点符号的单词将被拆分成单独的标记。

如何像以前一样以编程方式重新加入?

nltk.tokenize.word_tokenize()

请注意' '.join()from nltk import word_tokenize sentence = "Story: I wish my dog's hair was fluffier, and he ate better" print(sentence) => Story: I wish my dog's hair was fluffier, and he ate better tokens = word_tokenize(sentence) print(tokens) => ['Story', ':', 'I', 'wish', 'my', 'dog', "'s", 'hair', 'was', 'fluffier', ',', 'and', 'he', 'ate', 'better'] sentence = ' '.join(tokens) print(sentence) => Story : I wish my dog 's hair was fluffier , and he ate better 与原始版本不同。

2 个答案:

答案 0 :(得分:1)

来自this answer。您可以使用MosesDetokenizer作为解决方案。

只记得先下载nltk的子软件包:nltk.download('perluniprops')

>>>import nltk
>>>sentence = "Story: I wish my dog's hair was fluffier, and he ate better"
>>>tokens = nltk.word_tokenize(sentence)
>>>tokens
['Story', ':', 'I', 'wish', 'my', 'dog', "'s", 'hair', 'was', 'fluffier', ',', 'and', 'he', 'ate', 'better']
>>>from nltk.tokenize.moses import MosesDetokenizer
>>>detokens = MosesDetokenizer().detokenize(tokens, return_str=True)
>>>detokens
"Story: I wish my dog's hair was fluffier, and he ate better"

答案 1 :(得分:0)

加入u后可以使用替换功能

 sentence.replace(" '","'").replace(" : ",': ')
 #o/p 
 Story: I wish my dog's hair was fluffier , and he ate better