我想使用CuDNNLSTM来预测Y标签。我有一个数据集,我想使用CuDNNLSTM来预测代码。我正在考虑将句子作为X标签,将代码视为Y标签。 该模型实际上给出了每个类别的概率矩阵。我想知道 1.如何预测实际的句子代码 2。 数据集有点像这样:
Google headquarters is in California 98873
Google pixel is a very nice phone 98873
Steve Jobs was a great man 15890
Steve Jobs has done great technology innovations 15890
Microsoft is another great giant in technology 89736
Bill Gates founded Microsoft 89736
我从此链接获得帮助: https://towardsdatascience.com/multi-class-text-classification-with-lstm-1590bee1bd17
我正在使用的以下代码预测概率矩阵,我想知道它如何预测实际的句子代码。 另外,我们可以使用tfidf矢量化器吗?
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 50000
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMBEDDING_DIM = 100
tokenizer = keras.preprocessing.text.Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(df['procedureNew'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(df['procedureNew'].values)
X = keras.preprocessing.sequence.pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Y = pd.get_dummies(df['SuggestedCpt1']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10, random_state = 42)
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(CuDNNLSTM(100))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 10
batch_size = 40
history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
accr = model.evaluate(X_test,Y_test)
print('Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}'.format(accr[0],accr[1]*100, "%"))
new_sentence = ['Pixel phone is launched by Google']
seq = tokenizer.texts_to_sequences(new_procedure)
padded = keras.preprocessing.sequence.pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred = model.predict(padded)
labels = ['98873', '15890', '89736', '87325', '23689', '10368', '45789', '36975', '26987', '64721']
print(pred, labels[np.argmax(pred)])
print("\npredicted sentence code is", labels[np.argmax(pred)])