使用word2vec

时间:2018-07-30 01:27:12

标签: keras word2vec text-classification word-embedding

我想使用我自己的词数据集来创建嵌入。并使用我自己的标签数据来训练和测试模型。为此,我已经使用word2vec创建了自己的单词嵌入。并在使用标签数据训练模型时面临问题。

尝试训练模型时出现错误。我的模型创建代码:

# create the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
encoded_docs = tokenizer.texts_to_sequences(X_train)

max_length = max([len(s.split()) for s in X_train])
X_train = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_test)
encoded_docs = tokenizer.texts_to_sequences(X_test)

X_test = pad_sequences(encoded_docs, maxlen=max_length, padding='post')


# setup the embedding layer
embeddings = Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1],
                  weights=[embedding_matrix],input_length= max_length, trainable=False)

new_model = Sequential() new_model.add(embeddings)
new_model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
new_model.add(MaxPooling1D(pool_size=2)) new_model.add(Flatten())
new_model.add(Dense(1, activation='sigmoid'))

这就是我创建嵌入矩阵的方式-

embedding_matrix = np.zeros((len(model.wv.vocab), vector_dim))
    for i in range(len(model.wv.vocab)):
        embedding_vector = model.wv[model.wv.index2word[i]]
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

这样做,我得到以下错误-

 WARNING:tensorflow:From /Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:1290: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
Epoch 1/10
Traceback (most recent call last):
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[27,2] = 1049 is not in [0, 1045)
     [[Node: embedding_1/GatherV2 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embeddings/read, embedding_1/Cast, embedding_1/GatherV2/axis)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/faysal/Desktop/My Computer/D/Code Workspace/Research-IoT/embedding-tut/src/main.py", line 359, in <module>
    custom_keras_model(embedding_matrix, model.wv)
  File "/Users/faysal/Desktop/My Computer/D/Code Workspace/Research-IoT/Collaboration/embedding-tut/src/main.py", line 295, in custom_keras_model
    new_model.fit(X_train, y_train, epochs=10, verbose=2)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/models.py", line 867, in fit
    initial_epoch=initial_epoch)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/engine/training.py", line 1598, in fit
    validation_steps=validation_steps)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/engine/training.py", line 1183, in _fit_loop
    outs = f(ins_batch)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2273, in __call__
    **self.session_kwargs)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[27,2] = 1049 is not in [0, 1045)
     [[Node: embedding_1/GatherV2 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embeddings/read, embedding_1/Cast, embedding_1/GatherV2/axis)]]

Caused by op 'embedding_1/GatherV2', defined at:
  File "/Users/faysal/Desktop/My Computer/D/Code Workspace/Research-IoT/Collaboration/embedding-tut/src/main.py", line 359, in <module>
    custom_keras_model(embedding_matrix, model.wv)
  File "/Users/faysal/Desktop/My Computer/D/Code Workspace/Research-IoT/Collaboration/embedding-tut/src/main.py", line 278, in custom_keras_model
    new_model.add(embeddings)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/models.py", line 442, in add
    layer(x)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/engine/topology.py", line 602, in __call__
    output = self.call(inputs, **kwargs)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/layers/embeddings.py", line 134, in call
    out = K.gather(self.embeddings, inputs)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 1134, in gather
    return tf.gather(reference, indices)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 2736, in gather
    return gen_array_ops.gather_v2(params, indices, axis, name=name)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3065, in gather_v2
    "GatherV2", params=params, indices=indices, axis=axis, name=name)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): indices[27,2] = 1049 is not in [0, 1045)
     [[Node: embedding_1/GatherV2 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embeddings/read, embedding_1/Cast, embedding_1/GatherV2/axis)]]


Process finished with exit code 1

在将训练数据拟合到模型中时出现错误。我认为我在计算训练数据形状并将其注入模型中是错误的。

1 个答案:

答案 0 :(得分:1)

您正在使用两个不同的Tokenizer,并分别在训练和测试中对其进行训练。发生的情况是,您的令牌与培训和测试不匹配。造成您的错误是因为发生令牌(1049),该令牌不在max_length中。即使您修复了该问题,但是如果您有两个标记程序,您的模型也将无法工作。

您应该怎么做才能使Tokenizer适合所有数据(X_train和X_test),并且仅使用一个Tokenizer。