Question

我有一个定长的文本序列（如果愿意的话，是一个句子）（我已经填充了它们，所以它们是恒定的）。我有这些感觉的固定集合，我想将它们表示为向量。我的计划是使用自动编码器。有什么方法可以使自动编码器获得100％的准确结果。我需要的只是降维，因此我可以轻松地将这些句子聚类。序列的数量有限，所以我不明白为什么自动编码器不能接近0的损失。

#Goal is to build an autoencoder that memorizes the following combinations
#AAABBB
#AABBCC
#ABCABC
#BACBCA
#BCBCAB
#Note this is toy data. The actual data is much bigger. I have 8000 samples and each sample is a string of around 10 words. There are a total of 77 unique words.

#One-hot encoding should look like
#A ==> [1., 0, 0, 0]
#B ==> [0, 1., 0, 0]
#C ==> [0, 0, 1., 0]
#D ==> [0, 0, 0, 1.]

#Representing the 5 sequences when one-hot encoded looks like below
import numpy as np

data = [
          [
              [1., 0, 0, 0],
              [1., 0, 0, 0],
              [1., 0, 0, 0],
              [0, 1., 0, 0],
              [0, 1., 0, 0],
              [0, 1., 0, 0]
          ],
          [
              [1., 0, 0, 0],
              [1., 0, 0, 0],
              [0, 1., 0, 0],
              [0, 1., 0, 0],
              [0, 0, 1., 0],
              [0, 0, 1., 0]
          ],
          [
              [1., 0, 0, 0],
              [0, 1., 0, 0],
              [0, 0, 1., 0],
              [1., 0, 0, 0],
              [0, 1., 0, 0],
              [0, 0, 1., 0]
          ],
          [
              [0, 1., 0, 0],
              [1., 0, 0, 0],
              [0, 0, 1., 0],
              [0, 1., 0, 0],
              [0, 0, 1., 0],
              [1., 0, 0, 0]
          ],
          [
              [0, 1., 0, 0],
              [0, 0, 1., 0],
              [0, 1., 0, 0],
              [0, 0, 1., 0],
              [1., 0, 0, 0],
              [0, 1., 0, 0]
          ]
]

train = np.asarray(data)

NUMBER_OF_SAMPLES = train.shape[0]
NUMBER_OF_STEPS = train.shape[1]
ONEHOTENCODING_SIZE = train.shape[2]

HIDDEN_DIMENSION = 100

from keras.models import Sequential, Model
from keras.layers import Dense, Input
from keras.layers import TimeDistributed, RepeatVector
from keras.layers import LSTM
from keras.optimizers import Adam

model = Sequential()
model.add(LSTM(HIDDEN_DIMENSION, activation = 'relu', input_shape=(NUMBER_OF_STEPS, ONEHOTENCODING_SIZE), return_sequences=True))
model.add(LSTM(2, activation = 'relu', return_sequences=False))
model.add(RepeatVector(NUMBER_OF_STEPS))
model.add(LSTM(2, activation = 'relu', return_sequences=True))
model.add(LSTM(HIDDEN_DIMENSION, activation = 'relu', return_sequences=True))
model.add(TimeDistributed(Dense(ONEHOTENCODING_SIZE)))
adamOptimizer=Adam(lr=0.01)
model.compile(loss='mean_squared_error', optimizer=adamOptimizer)
print(model.summary())

请注意，代码中的数据只是一个很小的示例。我的真实样本的一维编码维度约为100，并且大约有8000个样本。大概有200个唯一的序列，所以我希望我可以将这200个唯一的向量之一分配给我的每个样本，并有一种完美地将它们聚类的方法。 las，最好的情况是我剩下大约5％的样本未正确编码（即，自动编码器输入与自动编码器预测不完全匹配）。

完美的自动编码器，可减少序列的维数。可能？

0 个答案: