如何将NLP应用于CNN模型?

时间:2020-07-18 04:12:34

标签: python keras nlp cnn

我正在研究将CNN机器学习模型与NLP(多标签分类)结合使用

我阅读了一些论文,提到在将CNN应用于多标签分类中获得了良好的结果

我正在尝试在Python上测试此模型。

我阅读了许多有关如何与NLP和神经网络一起工作的文章。

我的这段代码无法正常工作,并给我很多错误(每次修复错误时,我都会收到另一个错误)

我结束了寻找付费FreeLancers来帮助我修复代码的工作,我雇用了5个人,但没有一个人能够修复代码!

你是我最后的希望。

我希望有人能帮助我修复此代码并使它正常工作。

首先这是我的数据集(100条记录样本,只是为了确保代码能正常工作,我知道它的准确性不高。稍后我将进行调整和增强模型)

http://shrinx.it/data100.zip

当时我只是想让这段代码起作用。然而,非常欢迎有关提高准确性的提示。

我遇到了一些错误

InvalidArgumentError: indices[1] = [0,13] is out of order. Many sparse ops require sorted indices.
    Use `tf.sparse.reorder` to create a correctly ordered copy.

ValueError: Input 0 of layer sequential_8 is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [None, 18644]

这是我的代码

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from keras.layers import *
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam
from keras.models import *



# Load Dataset

df_text = pd.read_csv("J:\\__DataSets\\__Samples\\Test\\data100\\text100.csv")
df_results = pd.read_csv("J:\\__DataSets\\__Samples\\Test\\data100\\results100.csv")

df = pd.merge(df_text,df_results, on="ID")


#Prepare multi-label
Labels = [] 

for i in df['Code']: 
  Labels.append(i.split(",")) 


df['Labels'] = Labels



multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df['Labels'])

y = multilabel_binarizer.transform(df['Labels'])
X = df['Text'].values

#TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=1000)


xtrain, xval, ytrain, yval = train_test_split(X, y, test_size=0.2, random_state=9)

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=1000)

# create TF-IDF features
X_train_count = tfidf_vectorizer.fit_transform(xtrain)
X_test_count = tfidf_vectorizer.transform(xval)


#Prepare Model

input_dim = X_train_count.shape[1]  # Number of features
output_dim=len(df['Labels'].explode().unique())


sequence_length = input_dim
vocabulary_size = X_train_count.shape[0]
embedding_dim = output_dim
filter_sizes = [3,4,5]
num_filters = 512
drop = 0.5

epochs = 100
batch_size = 30



#CNN Model

inputs = Input(shape=(sequence_length,), dtype='int32')
embedding = Embedding(input_dim=vocabulary_size, output_dim=embedding_dim, input_length=sequence_length)(inputs)
reshape = Reshape((sequence_length,embedding_dim,1))(embedding)

conv_0 = Conv2D(num_filters, kernel_size=(filter_sizes[0], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_1 = Conv2D(num_filters, kernel_size=(filter_sizes[1], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_2 = Conv2D(num_filters, kernel_size=(filter_sizes[2], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)

maxpool_0 = MaxPool2D(pool_size=(sequence_length - filter_sizes[0] + 1, 1), strides=(1,1), padding='valid')(conv_0)
maxpool_1 = MaxPool2D(pool_size=(sequence_length - filter_sizes[1] + 1, 1), strides=(1,1), padding='valid')(conv_1)
maxpool_2 = MaxPool2D(pool_size=(sequence_length - filter_sizes[2] + 1, 1), strides=(1,1), padding='valid')(conv_2)

concatenated_tensor = Concatenate(axis=1)([maxpool_0, maxpool_1, maxpool_2])
flatten = Flatten()(concatenated_tensor)
dropout = Dropout(drop)(flatten)
output = Dense(units=2, activation='softmax')(dropout)


# this creates a model that includes
model = Model(inputs=inputs, outputs=output)


#Compile

checkpoint = ModelCheckpoint('weights.{epoch:03d}-{val_acc:.4f}.hdf5', monitor='val_acc', verbose=1, save_best_only=True, mode='auto')
adam = Adam(lr=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)


model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['accuracy'])
print("Traning Model...")
model.summary()


#Fit
model.fit(X_train_count, ytrain, batch_size=batch_size, epochs=epochs, verbose=1, callbacks=[checkpoint], validation_data=(X_test_count, yval))  # starts training



#Accuracy
loss, accuracy = model.evaluate(X_train_count, ytrain, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test_count, yval, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

我的数据集样本

text100.csv

ID  Text
1   Allergies to Drugs  Attending:[**First Name3 (LF) 1**] Chief Complaint: headache and neck stiffne
2   Complaint: fever, chills, rigors  Major Surgical or Invasive Procedure: Arterial l
3   Complaint: Febrile, unresponsive--> GBS meningitis and bacteremia  Major Surgi
4   Allergies to Drugs  Attending:[**First Name3 (LF) 45**] Chief Complaint: PEA arrest .   Major Sur
5   Admitted to an outside hospital with chest pain and ruled in for myocardial infarction.  She was tr
6   Known Allergies to Drugs  Attending:[**First Name3 (LF) 78**] Chief Complaint: Progressive lethargy 
7   Complaint: hypernatremia, unresponsiveness  Major Surgical or Invasive Procedure: PEG/tra
8   Chief Complaint: cough, SOB  Major Surgical or Invasive Procedure: RIJ placed Hemod

Results100.csv

ID  Code
1   A32,D50,G00,I50,I82,K51,M85,R09,R18,T82,Z51
2   418,475,905,921,A41,C50,D70,E86,F32,F41,J18,R11,R50,Z00,Z51,Z93,Z95
3   136,304,320,418,475,921,998,A40,B37,G00,G35,I10,J15,J38,J69,L27,L89,T81,T85
4   D64,D69,E87,I10,I44,N17
5   E11,I10,I21,I25,I47
6   905,C61,C91,E87,G91,I60,M47,M79,R50,S43
7   304,320,355,E11,E86,E87,F06,I10,I50,I63,I69,J15,J69,L89,L97,M81,N17,Z91

1 个答案:

答案 0 :(得分:0)

目前我没有什么要补充的,但是我发现以下两种调试策略对我有用:

  1. 将您的错误细分为不同的部分。例如,哪些错误与编译模型有关,哪些与培训有关?模型之前可能有错误。对于您显示的错误,它们是什么时候首次提出的?没有行号等就很难看。

此步骤对个人有用,因为有时以后的错误是早期错误的表现,因此有时50个错误在开始阶段可能只是1-2。

  1. 对于一个好的图书馆,通常他们的错误信息会有所帮助。您是否尝试过错误消息的提示以及如何处理?