如何从AutoEncoder模型中反转一个热编码输出向量?

时间:2018-05-22 12:40:38

标签: python machine-learning keras autoencoder one-hot-encoding

我有一个令牌列表作为输入。我使用了一个热编码将文本列表转换为二进制编码的矩阵。然后将该矩阵馈入简单的自动编码器架构。该体系结构由2个完全连接的层组成,并由此link的第一部分填充。

为了比较/理解这种架构的结果,我们需要反转一个热编码变换。此错误阻止此步骤:

ValueError:y包含新标签:[121]

def getTokens(xml_string):
firstTagMatches = re.findall('(\<\w+\>)', xml_string, re.DOTALL)
closedTagMatches = re.findall('(\<\/\w+\>)', xml_string, re.DOTALL)
betweenTagMatches = re.findall(r'>(.*?)<', xml_string)
xmlTokens = firstTagMatches + betweenTagMatches + closedTagMatches
return xmlTokens
def oneHotEncoding(data):
values = array(data)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
return onehot_encoded
def invertOneHotEncoding(data,decoded_imgs):
values = array(data)
print('values', values)
print('type', type(values))
 # integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print('integer_encoded', integer_encoded)
print('decoded images 1', decoded_imgs)
    # invert 
#final_decoded = np.zeros(shape= decoded_imgs.shape)
#for (x,y), value in np.ndenumerate(decoded_imgs):
#    if value > 0:
#        final_decoded[x,y] = 1
#print('decoded images 2', final_decoded)
inverted = [label_encoder.inverse_transform([argmax(decoded_imgs[i, :])]) for i in range(len(decoded_imgs))] 
#inverted = label_encoder.inverse_transform([argmax(final_decoded[0, :])])
return inverted
def getEncodedTrainingData(directoryPath):
#path = '/some/path/to/file'
encodedXML = np.zeros(shape=(0,166)) #because we know that the first file will give us a numpy array of shape=(166,166)
for filename in os.listdir(directoryPath):
    print('filename', filename)
    pretty_xml_as_string = ''
    trainingTokens = []
    xmlObject = xml.dom.minidom.parse(directoryPath+'/'+filename) 
    pretty_xml_as_string = xmlObject.toprettyxml()
    trainingTokens = getTokens(pretty_xml_as_string)
    transformedMatrice = oneHotEncoding(trainingTokens)
    dimensions = transformedMatrice.shape[0] * transformedMatrice.shape[1]
    onehotEncodedArray = np.resize(transformedMatrice, (int(dimensions/166),166)) #it may loose some information !! need a better solution
    print('onehotEncodedArray shape', onehotEncodedArray.shape)
    encodedXML = np.concatenate((encodedXML, onehotEncodedArray), axis=0)
print('before deleting shape', encodedXML.shape)
#encodedXML = np.delete(encodedXML,np.s_[0:166], axis=0) #remove the first initialized line
return encodedXML

encodedXML = getEncodedTrainingData('./trainingData/')
encodedXML_test = getEncodedTrainingData('./testingData/')

机器学习部分:

#make the AutoEncoder Model
# this is the size of our encoded representations
encoding_dim = 32  
# this is our input placeholder
input_vector = Input(shape=(166,))
# "encoded" is the encoded representation of the input
encoded = Dense(encoding_dim, activation='relu', 
activity_regularizer=regularizers.l1(10e-5))(input_vector)
# "decoded" is the lossy reconstruction of the input
decoded = Dense(166, activation='relu')(encoded)
# this model maps an input to its reconstruction
autoencoder = Model(input_vector, decoded)
 encoder = Model(input_vector, encoded)
 encoded_input = Input(shape=(encoding_dim,))
 # retrieve the last layer of the autoencoder model
 decoder_layer = autoencoder.layers[-1]
 # create the decoder model
 decoder = Model(encoded_input, decoder_layer(encoded_input))
  autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy', metrics=['binary_accuracy', 'categorical_accuracy'])
 x_train = encodedXML
 x_test = encodedXML_test
 autoencoder.fit(x_train, x_train,
            epochs=50,
            batch_size=256,
            shuffle=False,
            validation_data=(x_test, x_test))
 # encode and decode some digits
 # note that we take them from the *test* set
 encoded_imgs = encoder.predict(x_test)
 decoded_imgs = decoder.predict(encoded_imgs)
 invertedSentence = invertOneHotEncoding(testDataTokens,decoded_imgs)

2 个答案:

答案 0 :(得分:0)

您输入和输出混乱。让我们简化一下,假设有4个标签,然后是你的x_train = [[0,1,0,0], [0,0,0,1]]所以你有2个数据点已经有一个热编码。现在,您可以将自动编码器构建为:

in = Input(shape=(num_classes,))
enc = Dense(encoding_dim, activation='relu')(in)
out = Dense(num_classes, activation='softmax')(enc)

您的网络必须根据编码矢量将其中一个类预测为分布。鉴于它仍然是一个分类问题,在训练时你将失去categorical_crossentropy

答案 1 :(得分:0)

实际上代码完全正常。问题是,当我在实际数据上训练autoEncoder时,数据不够。所以当我测试自动编码器时,预测的矢量并不完全是测试矢量。这就是我得到ValueError异常的原因。为了获得最终的解码输出,我修改了'#OtherOneHotEncoding&#39;拥有:

def invertOneHotEncoding(label_encoder,decoded_imgs):
inverted = []
for i in range(len(decoded_imgs)):
    try:
        print('decoded', label_encoder.inverse_transform([argmax(decoded_imgs[i, :])]))
        inverted.append(label_encoder.inverse_transform([argmax(decoded_imgs[i, :])]))
    except ValueError:
        inverted.append('<>')
return inverted

所以我现在想,为什么我得到那个错误是有道理的。