使用实体嵌入和Keras功能API将文本与分类特征结合起来

时间:2018-10-23 17:23:24

标签: python neural-network keras nlp embedding

我正在尝试使用Keras功能API将文本和类别变量组合在一起。

模型可以编译,但是当我尝试使用“ fit”训练模型时,它给了我一个错误。 看起来我输入数据的方式是错误的。 有没有人做过类似的事情并且知道如何使之工作?

用于构建模型的代码:

all_inputs = []
cat_embeddings = []

text_input = Input(shape=(MAX_SEQUENCE_LENGTH, ),name='text_input')
all_inputs.append(text_input)

x = Embedding(embedding_matrix.shape[0], # or len(word_index) + 1
              embedding_matrix.shape[1], # or EMBEDDING_DIM,
              weights=[embedding_matrix],
              input_length=MAX_SEQUENCE_LENGTH,
              trainable=True)(text_input)

x = SpatialDropout1D(0.2)(x)
x = Bidirectional(GRU(128, return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)
x = Conv1D(64, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform")(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)

for cat in Categorical_Features:
    cat_input = Input(shape=(1,), name=cat)
    no_of_unique_cat  = X_train[cat].nunique()
    embedding_size = np.ceil((no_of_unique_cat)/2)
    embedding_size = int(embedding_size)
    cat_embedding = Embedding(no_of_unique_cat+1, embedding_size, input_length = 1)(cat_input)
    cat_embedding = Reshape(target_shape=(embedding_size,))(cat_embedding)

    all_inputs.append(cat_input)
    cat_embeddings.append(cat_embedding)

conc = Concatenate()(cat_embeddings)
x = concatenate([conc, x])

x = Dense(128, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(128, activation='relu')(x)

preds = Dense(93, activation="sigmoid")(x)

model = Model(inputs=all_inputs, outputs=preds)
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=1e-3), metrics=['accuracy'])

然后我尝试以这种方式放置数据:

model.fit([X_train_seq,
           X_train['Categorical_Features_1'],
           X_train['Categorical_Features_2'],
           X_train['Categorical_Features_3'],
           X_train['Categorical_Features_4'],
           X_train['Categorical_Features_5'],
           X_train['Categorical_Features_6']]
          ,
          y_train, 
          validation_split=0.2, 
          class_weight = d_class_weights,
          epochs=5, 
          batch_size=512)

然后我得到这个错误:

InvalidArgumentError: indices[460,0] = 421 is not in [0, 406)
     [[{{node embedding_18/GatherV2}} = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@training_8/Adam/gradients/embedding_18/GatherV2_grad/Reshape"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_18/embeddings/read, embedding_18/Cast, embedding_17/GatherV2/axis)]]

此代码的灵感来自于Kaggle竞赛中的代码:

https://www.kaggle.com/aquatic/entity-embedding-neural-net

和论文“分类变量的实体嵌入”:

https://arxiv.org/abs/1604.06737

1 个答案:

答案 0 :(得分:0)

问题出在Embedding层中,因此您的len(word_index) + 1embedding_matrix.shape[0]实际上不对应于最大索引。在这种情况下,len(word_index) + 1 = 407但您的第一个输入整数索引实际上是421,超出范围。

您需要仔细检查X_train_seq索引,以及它们是否在提供给Embedding层的范围内。