我的模型无法学习,使用LSTM进行文本分类

时间:2020-11-01 16:42:56

标签: python tensorflow keras deep-learning nlp

我的数据可以在这里找到

https://www.dropbox.com/sh/53ii1gpm155f1x8/AADoYZk3cQt5Zw7tfuSV6kZBa?dl=0

DataFram是关于(我有3位作家,并且文字属于他们)

我正在做的任务是:用LSTM层进行情感分析,这意味着最后我要在看不见的数据上测试模型,并告诉我该文本属于哪个作者。

我已经花了很多小时来研究这个问题,但是最后,我的模型精度并没有随着epcho次数的变化而改变。

由于我是深度学习的新手,所以我不知道哪里出了问题。 还有什么我可以做以提高准确性?

有人可以帮我吗?

所以代码如下:

import pandas as pd
import numpy as np
import xgboost as xgb
from tqdm import tqdm
from sklearn.svm import SVC
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('English')


traincleaned=pd.read_csv('../input/dataclean/data.csv')

traincleaned.drop('Unnamed: 0', axis=1,inplace=True)  
traincleaned[:3]

data_list=train_sentence.clean_text.apply(str).tolist()
test_list=evalu.clean_text.apply(str).tolist()


X = traincleaned.clean_text
Y = traincleaned.author

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1000)
print(X_train.shape)
print(y_train.shape)


# transform labels into numbers

from sklearn.preprocessing import LabelEncoder
labels2numbers = LabelEncoder()
y_train = labels2numbers.fit_transform(y_train)
y_val = labels2numbers.fit_transform(y_test)

print(y_train.shape)

#Converts a class vector (integers) to binary class matrix.

label_encoder = LabelEncoder()
integer_encoded1 = label_encoder.fit_transform(y_train)
integer_encoded2 = labels2numbers.fit_transform(y_test)


le_name_mapping = dict(zip(label_encoder.transform(label_encoder.classes_),label_encoder.classes_))
print("Label Encoding Classes as ")
print(le_name_mapping)

y_train=np_utils.to_categorical(integer_encoded1,num_classes=3)
y_val=np_utils.to_categorical(integer_encoded2,num_classes=3)
print("One Hot Encoded class shape ")
print(y_train.shape)
print(y_train[0])



X_train0 = X_train.values.astype(str)
#y_train = df0_train['sentiment'].values

X_val0 = X_test.values.astype(str)
#y_val = df0_val['sentiment'].values
print('df_train shape: {}'.format(df0_train.shape))
print('df_val shape: {}'.format(df0_val.shape))





from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences    
tokenizer = Tokenizer(num_words=20000)
tokenizer.fit_on_texts(X_train0)
X_train = tokenizer.texts_to_sequences(X_train0)
X_test = tokenizer.texts_to_sequences(X_val0)
vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index
from keras.preprocessing.sequence import pad_sequences
maxlen = 300
padded_sequence_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
padded_sequence_test = pad_sequences(X_test, padding='post', maxlen=maxlen)
print(vocab_size)
print(X_train[2])
print(X_train0[2])



#find the max lenght of sentecnce for padding
maxlen = train_sentence.clean_text.str.len().max()
maxlen



from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Dense, Dropout
from tensorflow.keras.layers import SpatialDropout1D
from tensorflow.keras.layers import Embedding

embedding_vector_length = 50

model = Sequential()

model.add(Embedding(vocab_size, embedding_vector_length,     
                                         input_length=300) )

model.add(SpatialDropout1D(0.25))
model.add(LSTM(30, dropout=0.5, recurrent_dropout=0.5))
model.add(Dropout(0.2))
model.add(Dense(3, activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam', 
                           metrics=['accuracy'])
print(model.summary())


history = model.fit(padded_sequence_train, y_train,validation_data=(padded_sequence_test, y_val),epochs=20,batch_size=70)


模型总结如下:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_17 (Embedding)     (None, 300, 50)           881900    
_________________________________________________________________
spatial_dropout1d_16 (Spatia (None, 300, 50)           0         
_________________________________________________________________
lstm_17 (LSTM)               (None, 30)                9720      
_________________________________________________________________
dropout_16 (Dropout)         (None, 30)                0         
_________________________________________________________________
dense_17 (Dense)             (None, 3)                 93        
=================================================================
Total params: 891,713
Trainable params: 891,713
Non-trainable params: 0
_________________________________________________________________
None

此外,当我拟合模型时,结果如下:


Train on 15663 samples, validate on 3916 samples
Epoch 1/30
15663/15663 [==============================] - 92s 6ms/step - loss: 0.6338 - acc: 0.6667 - val_loss: 0.6318 - val_acc: 0.6667
Epoch 2/30
15663/15663 [==============================] - 93s 6ms/step - loss: 0.6331 - acc: 0.6666 - val_loss: 0.6317 - val_acc: 0.6667
Epoch 3/30
15663/15663 [==============================] - 92s 6ms/step - loss: 0.6326 - acc: 0.6667 - val_loss: 0.6318 - val_acc: 0.6667
Epoch 4/30
15663/15663 [==============================] - 92s 6ms/step - loss: 0.6327 - acc: 0.6667 - val_loss: 0.6315 - val_acc: 0.6667
Epoch 5/30
15663/15663 [==============================] - 91s 6ms/step - loss: 0.6321 - acc: 0.6667 - val_loss: 0.6316 - val_acc: 0.6667
Epoch 6/30
15663/15663 [==============================] - 94s 6ms/step - loss: 0.6321 - acc: 0.6667 - val_loss: 0.6316 - val_acc: 0.6667
Epoch 7/30
15663/15663 [==============================] - 91s 6ms/step - loss: 0.6320 - acc: 0.6667 - val_loss: 0.6316 - val_acc: 0.6667
Epoch 8/30
15663/15663 [==============================] - 92s 6ms/step - loss: 0.6321 - acc: 0.6667 - val_loss: 0.6318 - val_acc: 0.6667
Epoch 9/30
15610/15663 [============================>.] - ETA: 0s - loss: 0.6318 - acc: 0.6667

0 个答案:

没有答案