Question

我不确定我是否以正确的方式提出问题。我编码了POS标签，如下所示

from sklearn.preprocessing import LabelBinarizer

encoder = LabelBinarizer()
transfomed_label = encoder.fit_transform(["CC","CD","DT","EX","FW","IN","JJ","JJR","JJS","LS","MD","NN","NNS","NNP","NNPS","PDT","POS","PRP","PRP$","RB","RBR","RBS","RP","SYM","TO","UH","VB","VBD","VBG","VBN","VBP","VBZ","WDT","WP","WP$","WRB"])
#print(transfomed_label)
#START OF This is to get the mapping between the labels and its index
#print(encoder.classes_)
labels = encoder.classes_
mappings = {}
for index, label in zip(range(len(labels)), labels):
  mappings[label]=index
  #print(mappings)
#END OF This is to get the mapping between the labels and its index


for item in transfomed_label:
    print (item)

现在，我有一个句子，我已经取了句子的POS

import nltk
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
examplearray=['This is Timothy learning python']
for item in examplearray:
    tokenized=nltk.word_tokenize(item)
    tagged=nltk.pos_tag(tokenized)
    print(tagged)

这给了我[('This', 'DT'), ('is', 'VBZ'), ('Timothy', 'NNP'), ('learning', 'VBG'), ('python', 'NN')]

我希望这句话被编码为

[[000001000],[100000000],[010000000],[000000001],[000100000]]

*上述载体具有代表性

任何人都可以帮我制作一个与输入句子对应的矢量数组。

Answer 1

如果我理解你，你想要这样的东西：

res = [transfomed_label[mappings[tagged[j][1]]] for j in xrange(len(tagged))]

Answer 2

首先让我们获取nltk pacakge中的所有pos标签。（请注意!!这取决于您使用的语言的penn-tree bank。）

 pos_tags_list = ['CC', 'CD', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNS','NNP', 'NNPS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB']

现在制作两个地图词典

tag_to_num = {tag:i for i, tag in enumerated(sorted(pos_tag_list))} 
num_to_tag = {i:tag for i, tag in enumerated(sorted(pos_tag_list))}

现在，您可以从句子中提取所有标记，并使用sklearn one-hot encoder或pandas dummies或keras to_catgeorical方法对您的标记进行编码。

Python - 如何将编码的一个热矢量分配给字符串值

2 个答案: