我正在研究将CNN机器学习模型与NLP(多标签分类)结合使用
我阅读了一些论文,提到在将CNN应用于多标签分类中获得了良好的结果
我正在尝试在Python上测试此模型。
我阅读了许多有关如何与NLP和神经网络一起工作的文章。
我的这段代码无法正常工作,并给我很多错误(每次修复错误时,我都会收到另一个错误)
我结束了寻找付费FreeLancers来帮助我修复代码的工作,我雇用了5个人,但没有一个人能够修复代码!
你是我最后的希望。
我希望有人能帮助我修复此代码并使它正常工作。
首先这是我的数据集(100条记录样本,只是为了确保代码能正常工作,我知道它的准确性不高。稍后我将进行调整和增强模型)
当时我只是想让这段代码起作用。然而,非常欢迎有关提高准确性的提示。
我遇到了一些错误
InvalidArgumentError: indices[1] = [0,13] is out of order. Many sparse ops require sorted indices.
Use `tf.sparse.reorder` to create a correctly ordered copy.
和
ValueError: Input 0 of layer sequential_8 is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [None, 18644]
这是我的代码
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from keras.layers import *
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam
from keras.models import *
# Load Dataset
df_text = pd.read_csv("J:\\__DataSets\\__Samples\\Test\\data100\\text100.csv")
df_results = pd.read_csv("J:\\__DataSets\\__Samples\\Test\\data100\\results100.csv")
df = pd.merge(df_text,df_results, on="ID")
#Prepare multi-label
Labels = []
for i in df['Code']:
Labels.append(i.split(","))
df['Labels'] = Labels
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df['Labels'])
y = multilabel_binarizer.transform(df['Labels'])
X = df['Text'].values
#TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=1000)
xtrain, xval, ytrain, yval = train_test_split(X, y, test_size=0.2, random_state=9)
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=1000)
# create TF-IDF features
X_train_count = tfidf_vectorizer.fit_transform(xtrain)
X_test_count = tfidf_vectorizer.transform(xval)
#Prepare Model
input_dim = X_train_count.shape[1] # Number of features
output_dim=len(df['Labels'].explode().unique())
sequence_length = input_dim
vocabulary_size = X_train_count.shape[0]
embedding_dim = output_dim
filter_sizes = [3,4,5]
num_filters = 512
drop = 0.5
epochs = 100
batch_size = 30
#CNN Model
inputs = Input(shape=(sequence_length,), dtype='int32')
embedding = Embedding(input_dim=vocabulary_size, output_dim=embedding_dim, input_length=sequence_length)(inputs)
reshape = Reshape((sequence_length,embedding_dim,1))(embedding)
conv_0 = Conv2D(num_filters, kernel_size=(filter_sizes[0], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_1 = Conv2D(num_filters, kernel_size=(filter_sizes[1], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_2 = Conv2D(num_filters, kernel_size=(filter_sizes[2], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
maxpool_0 = MaxPool2D(pool_size=(sequence_length - filter_sizes[0] + 1, 1), strides=(1,1), padding='valid')(conv_0)
maxpool_1 = MaxPool2D(pool_size=(sequence_length - filter_sizes[1] + 1, 1), strides=(1,1), padding='valid')(conv_1)
maxpool_2 = MaxPool2D(pool_size=(sequence_length - filter_sizes[2] + 1, 1), strides=(1,1), padding='valid')(conv_2)
concatenated_tensor = Concatenate(axis=1)([maxpool_0, maxpool_1, maxpool_2])
flatten = Flatten()(concatenated_tensor)
dropout = Dropout(drop)(flatten)
output = Dense(units=2, activation='softmax')(dropout)
# this creates a model that includes
model = Model(inputs=inputs, outputs=output)
#Compile
checkpoint = ModelCheckpoint('weights.{epoch:03d}-{val_acc:.4f}.hdf5', monitor='val_acc', verbose=1, save_best_only=True, mode='auto')
adam = Adam(lr=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['accuracy'])
print("Traning Model...")
model.summary()
#Fit
model.fit(X_train_count, ytrain, batch_size=batch_size, epochs=epochs, verbose=1, callbacks=[checkpoint], validation_data=(X_test_count, yval)) # starts training
#Accuracy
loss, accuracy = model.evaluate(X_train_count, ytrain, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test_count, yval, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
我的数据集样本
text100.csv
ID Text
1 Allergies to Drugs Attending:[**First Name3 (LF) 1**] Chief Complaint: headache and neck stiffne
2 Complaint: fever, chills, rigors Major Surgical or Invasive Procedure: Arterial l
3 Complaint: Febrile, unresponsive--> GBS meningitis and bacteremia Major Surgi
4 Allergies to Drugs Attending:[**First Name3 (LF) 45**] Chief Complaint: PEA arrest . Major Sur
5 Admitted to an outside hospital with chest pain and ruled in for myocardial infarction. She was tr
6 Known Allergies to Drugs Attending:[**First Name3 (LF) 78**] Chief Complaint: Progressive lethargy
7 Complaint: hypernatremia, unresponsiveness Major Surgical or Invasive Procedure: PEG/tra
8 Chief Complaint: cough, SOB Major Surgical or Invasive Procedure: RIJ placed Hemod
Results100.csv
ID Code
1 A32,D50,G00,I50,I82,K51,M85,R09,R18,T82,Z51
2 418,475,905,921,A41,C50,D70,E86,F32,F41,J18,R11,R50,Z00,Z51,Z93,Z95
3 136,304,320,418,475,921,998,A40,B37,G00,G35,I10,J15,J38,J69,L27,L89,T81,T85
4 D64,D69,E87,I10,I44,N17
5 E11,I10,I21,I25,I47
6 905,C61,C91,E87,G91,I60,M47,M79,R50,S43
7 304,320,355,E11,E86,E87,F06,I10,I50,I63,I69,J15,J69,L89,L97,M81,N17,Z91
答案 0 :(得分:0)
目前我没有什么要补充的,但是我发现以下两种调试策略对我有用:
此步骤对个人有用,因为有时以后的错误是早期错误的表现,因此有时50个错误在开始阶段可能只是1-2。