我不确定我是否以正确的方式提出问题。我编码了POS标签,如下所示
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
transfomed_label = encoder.fit_transform(["CC","CD","DT","EX","FW","IN","JJ","JJR","JJS","LS","MD","NN","NNS","NNP","NNPS","PDT","POS","PRP","PRP$","RB","RBR","RBS","RP","SYM","TO","UH","VB","VBD","VBG","VBN","VBP","VBZ","WDT","WP","WP$","WRB"])
#print(transfomed_label)
#START OF This is to get the mapping between the labels and its index
#print(encoder.classes_)
labels = encoder.classes_
mappings = {}
for index, label in zip(range(len(labels)), labels):
mappings[label]=index
#print(mappings)
#END OF This is to get the mapping between the labels and its index
for item in transfomed_label:
print (item)
现在,我有一个句子,我已经取了句子的POS
import nltk
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
examplearray=['This is Timothy learning python']
for item in examplearray:
tokenized=nltk.word_tokenize(item)
tagged=nltk.pos_tag(tokenized)
print(tagged)
这给了我[('This', 'DT'), ('is', 'VBZ'), ('Timothy', 'NNP'), ('learning', 'VBG'), ('python', 'NN')]
我希望这句话被编码为
[[000001000],[100000000],[010000000],[000000001],[000100000]]
*上述载体具有代表性
任何人都可以帮我制作一个与输入句子对应的矢量数组。
答案 0 :(得分:1)
如果我理解你,你想要这样的东西:
res = [transfomed_label[mappings[tagged[j][1]]] for j in xrange(len(tagged))]
答案 1 :(得分:0)
首先让我们获取nltk pacakge中的所有pos标签。 (请注意!!这取决于您使用的语言的penn-tree bank。)
pos_tags_list = ['CC', 'CD', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNS','NNP', 'NNPS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB']
现在制作两个地图词典
tag_to_num = {tag:i for i, tag in enumerated(sorted(pos_tag_list))}
num_to_tag = {i:tag for i, tag in enumerated(sorted(pos_tag_list))}
现在,您可以从句子中提取所有标记,并使用sklearn one-hot encoder
或pandas dummies
或keras to_catgeorical
方法对您的标记进行编码。