我试图为两个独特的输入序列创建一个嵌入。因此,对于每个观察,采用整数符号序列和时间序列向量,以创建嵌入向量。看起来像一个输入的标准方法是创建一个自动编码器,将数据作为输入和输出,并提取隐藏层的输出作为嵌入。
我使用keras,似乎我几乎就在那里。输入1具有形状(1000000,50)(一百万个长度为50的整数列表)。输入2的形状(1000000,50,1)。
以下是我的keras代码。
##########################################
# Input 1: event type sequences
input_1a = Input(shape =(max_seq_length,), dtype = 'int32', name = 'first_input')
# Input 1: Embedding layer
input_1b = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(input_1a)
# Input 1: LSTM
input_1c = LSTM(10, return_sequences = True)(input_1b)
##########################################
# Input 2: unix time (minutes) vectors
input_2a = Input(shape=(max_seq_length,1), dtype='float32', name='second_input')
# Input 2: Masking
input_2b = Masking(mask_value = 99999999.0)(input_2a)
# Input 2: LSTM
input_2c = LSTM(10, return_sequences = True)(input_2b)
##########################################
# Concatenation layer here
x = keras.layers.concatenate([input_1c, input_2c])
x2 = Dense(40, activation='relu')(x)
x3 = Dense(20, activation='relu', name = "journey_embeddings")(x2)
##########################################
# Re-create the inputs
xl = Lambda(lambda x: x, output_shape=lambda s:s)(x3)
xf = Flatten()(xl)
xf1 = Dense(20, activation='relu')(xf)
xf2 = Dense(50, activation='relu')(xf1)
xd = Dense(20, activation='relu')(x3)
xd2 = TimeDistributed(Dense(1, activation='linear'))(xd)
##########################################
## Compile and fit the model
model = Model(inputs=[input_1a, input_2a], outputs=[xf2,xd2])
model.compile(optimizer = rms_prop, loss = 'mse')
print(model.summary())
np.random.seed(21)
model.fit([X1,X2], [X1,X2], epochs=1, batch_size=200)
一旦我运行了这个,我就会提取" journey_embeddings"隐藏层输出如下:
layer_name = 'journey_embeddings'
intermediate_layer_model = Model(inputs=model.input, outputs=model.get_layer(layer_name).output)
intermediate_output = intermediate_layer_model.predict([X1,X2])
但是,intermediate_output的形状是(1000000,50,20)。我想得到一个长度为20的嵌入向量。如何获得(1000000,20)的形状?
答案 0 :(得分:1)
您在LSTM中使用return_sequences=True
并再次返回时间序列,而不是将序列编码为大小为20的单个向量。这将返回形状(..,50,20),因为它输出LSTM的隐藏状态每一个时间步。大概你想要将所有50个时间步长编码成一个向量,那么你就不应该返回序列。
答案 1 :(得分:0)
感谢@nuric,以下代码有效:
##########################################
# Input 1: event type sequences
input_1a = Input(shape =(max_seq_length,), dtype = 'int32', name = 'first_input')
# Input 1: Embedding layer
input_1b = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(input_1a)
# Input 1: LSTM
input_1c = LSTM(10, return_sequences = False)(input_1b)
##########################################
# Input 2: unix time (minutes) vectors
input_2a = Input(shape=(max_seq_length,1), dtype='float32', name='second_input')
# Input 2: Masking
input_2b = Masking(mask_value = 99999999.0)(input_2a)
# Input 2: LSTM
input_2c = LSTM(10, return_sequences = False)(input_2b)
##########################################
# Concatenation layer here
x = keras.layers.concatenate([input_1c, input_2c])
x2 = Dense(40, activation='relu')(x)
x3 = Dense(20, activation='relu', name = "journey_embeddings")(x2)
##########################################
# An abitrary number of dense, hidden layers here
xf1 = Dense(20, activation='relu')(x3)
xf2 = Dense(50, activation='relu')(xf1)
xd = Dense(50, activation='relu')(x3)
xd2 = Reshape((50, 1))(xd)
##########################################
## Compile and fit the model
model = Model(inputs=[input_1a, input_2a], outputs=[xf2,xd2])
model.compile(optimizer = rms_prop, loss = 'mse')
print(model.summary())
np.random.seed(21)
model.fit([X1,X2], [X1,X2], epochs=1, batch_size=200)