Python - 如何将编码的一个热矢量分配给字符串值

时间:2018-04-01 07:46:56

标签: python one-hot-encoding

我不确定我是否以正确的方式提出问题。我编码了POS标签,如下所示

from sklearn.preprocessing import LabelBinarizer

encoder = LabelBinarizer()
transfomed_label = encoder.fit_transform(["CC","CD","DT","EX","FW","IN","JJ","JJR","JJS","LS","MD","NN","NNS","NNP","NNPS","PDT","POS","PRP","PRP$","RB","RBR","RBS","RP","SYM","TO","UH","VB","VBD","VBG","VBN","VBP","VBZ","WDT","WP","WP$","WRB"])
#print(transfomed_label)
#START OF This is to get the mapping between the labels and its index
#print(encoder.classes_)
labels = encoder.classes_
mappings = {}
for index, label in zip(range(len(labels)), labels):
  mappings[label]=index
  #print(mappings)
#END OF This is to get the mapping between the labels and its index


for item in transfomed_label:
    print (item)

现在,我有一个句子,我已经取了句子的POS

import nltk
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
examplearray=['This is Timothy learning python']
for item in examplearray:
    tokenized=nltk.word_tokenize(item)
    tagged=nltk.pos_tag(tokenized)
    print(tagged)   

这给了我[('This', 'DT'), ('is', 'VBZ'), ('Timothy', 'NNP'), ('learning', 'VBG'), ('python', 'NN')]

我希望这句话被编码为

[[000001000],[100000000],[010000000],[000000001],[000100000]]

*上述载体具有代表性

任何人都可以帮我制作一个与输入句子对应的矢量数组。

2 个答案:

答案 0 :(得分:1)

如果我理解你,你想要这样的东西:

res = [transfomed_label[mappings[tagged[j][1]]] for j in xrange(len(tagged))]

答案 1 :(得分:0)

首先让我们获取nltk pacakge中的所有pos标签。 (请注意!!这取决于您使用的语言的penn-tree bank。)

 pos_tags_list = ['CC', 'CD', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNS','NNP', 'NNPS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'] 

现在制作两个地图词典

tag_to_num = {tag:i for i, tag in enumerated(sorted(pos_tag_list))} 
num_to_tag = {i:tag for i, tag in enumerated(sorted(pos_tag_list))}

现在,您可以从句子中提取所有标记,并使用sklearn one-hot encoder或pandas dummies或keras to_catgeorical方法对您的标记进行编码。