我有一个定长的文本序列(如果愿意的话,是一个句子)(我已经填充了它们,所以它们是恒定的)。我有这些感觉的固定集合,我想将它们表示为向量。我的计划是使用自动编码器。有什么方法可以使自动编码器获得100%的准确结果。我需要的只是降维,因此我可以轻松地将这些句子聚类。序列的数量有限,所以我不明白为什么自动编码器不能接近0的损失。
#Goal is to build an autoencoder that memorizes the following combinations
#AAABBB
#AABBCC
#ABCABC
#BACBCA
#BCBCAB
#Note this is toy data. The actual data is much bigger. I have 8000 samples and each sample is a string of around 10 words. There are a total of 77 unique words.
#One-hot encoding should look like
#A ==> [1., 0, 0, 0]
#B ==> [0, 1., 0, 0]
#C ==> [0, 0, 1., 0]
#D ==> [0, 0, 0, 1.]
#Representing the 5 sequences when one-hot encoded looks like below
import numpy as np
data = [
[
[1., 0, 0, 0],
[1., 0, 0, 0],
[1., 0, 0, 0],
[0, 1., 0, 0],
[0, 1., 0, 0],
[0, 1., 0, 0]
],
[
[1., 0, 0, 0],
[1., 0, 0, 0],
[0, 1., 0, 0],
[0, 1., 0, 0],
[0, 0, 1., 0],
[0, 0, 1., 0]
],
[
[1., 0, 0, 0],
[0, 1., 0, 0],
[0, 0, 1., 0],
[1., 0, 0, 0],
[0, 1., 0, 0],
[0, 0, 1., 0]
],
[
[0, 1., 0, 0],
[1., 0, 0, 0],
[0, 0, 1., 0],
[0, 1., 0, 0],
[0, 0, 1., 0],
[1., 0, 0, 0]
],
[
[0, 1., 0, 0],
[0, 0, 1., 0],
[0, 1., 0, 0],
[0, 0, 1., 0],
[1., 0, 0, 0],
[0, 1., 0, 0]
]
]
train = np.asarray(data)
NUMBER_OF_SAMPLES = train.shape[0]
NUMBER_OF_STEPS = train.shape[1]
ONEHOTENCODING_SIZE = train.shape[2]
HIDDEN_DIMENSION = 100
from keras.models import Sequential, Model
from keras.layers import Dense, Input
from keras.layers import TimeDistributed, RepeatVector
from keras.layers import LSTM
from keras.optimizers import Adam
model = Sequential()
model.add(LSTM(HIDDEN_DIMENSION, activation = 'relu', input_shape=(NUMBER_OF_STEPS, ONEHOTENCODING_SIZE), return_sequences=True))
model.add(LSTM(2, activation = 'relu', return_sequences=False))
model.add(RepeatVector(NUMBER_OF_STEPS))
model.add(LSTM(2, activation = 'relu', return_sequences=True))
model.add(LSTM(HIDDEN_DIMENSION, activation = 'relu', return_sequences=True))
model.add(TimeDistributed(Dense(ONEHOTENCODING_SIZE)))
adamOptimizer=Adam(lr=0.01)
model.compile(loss='mean_squared_error', optimizer=adamOptimizer)
print(model.summary())
请注意,代码中的数据只是一个很小的示例。我的真实样本的一维编码维度约为100,并且大约有8000个样本。大概有200个唯一的序列,所以我希望我可以将这200个唯一的向量之一分配给我的每个样本,并有一种完美地将它们聚类的方法。 las,最好的情况是我剩下大约5%的样本未正确编码(即,自动编码器输入与自动编码器预测不完全匹配)。