我将训练机器学习模型,该模型将某些标签分配给描述活动的段落。在我的数据库中,对于描述(X)的给定段落,有几个与之相关的对应标签(Y)。希望提高分类的准确性。
我通过Scikit-learn-learn构建了几种机器学习模型(例如SVC,DecisionTreeClassifier,KNeighborsClassifier ,RadiusNeighborsClassifier,ExtraTreesClassifier,RandomForestClassifier,MLPClassifier,RidgeClassifierCV)和通过Keras的神经网络模型。使用OneVsRestClassifier(SGDClassifier),我可以获得的最佳准确性(苛刻指标)为47%。
print(X)
0 Contribution to METU HS Ankara Lab Protocols ...
1 Attend the MakerFaire in Hannover to demonstr...
2 Organize a "Biotech Day" and present the proj...
3 Contact and connect with Community Labs in Eu...
4 Invite "Technik Garage," a German Community L...
5 Present the project to the biotechnology comp...
6 Visit one of Europe's largest detergent plant...
...
print(y2)
0 [Community Event]
1 [Project Presentation, Community Event]
2 [Project Presentation, Teaching Activity]
3 [Conference/Panel Discussion, Consult Experts]
4 [Conference/Panel Discussion, Consult Experts]
5 [Conference/Panel Discussion, Project Presenta...
6 [Consult Experts]
...
...
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb_y2 = mlb.fit_transform(y2)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, mlb_y2, test_size=0.2, random_state=52)
Scikit-learn:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
pipe = Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),('classifier', OneVsRestClassifier(SGDClassifier(loss = 'hinge', alpha=0.00026, penalty='elasticnet', max_iter=2000,tol=0.0008, learning_rate = 'adaptive', eta0 = 0.12)))])
pipe.fit(X_train, y_train)
print("test model score: %.3f" % pipe.score(X_test, y_test))
print("train model score: %.3f" % pipe.score(X_train, y_train))
test model score: 0.478
train model score: 0.801 (overfitting exist! I adjusted the penalty & alpha term, but it doesn't improve much. I don't know whether there is any other way to do the regulation.)
Keras:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=300, lower=True)
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)
vocab_size = len(tokenizer.word_index) + 1
x = pad_sequences(sequences, padding='post', maxlen=80)
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, Flatten, GlobalMaxPool1D, Dropout, Conv1D, LSTM, SpatialDropout1D
from keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint
from keras.losses import binary_crossentropy
from keras.optimizers import Adam
import sklearn
filter_length = 1000
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim= 70, input_length=80))
model.add(Dropout(0.1))
model.add(Conv1D(filter_length, 3, padding='valid', activation='relu', strides=1))
model.add(GlobalMaxPool1D())
#model.add(SpatialDropout1D(0.1))
#model.add(LSTM(100, dropout=0.1, recurrent_dropout=0.1))
model.add(Dense(len(mlb.classes_)))
model.add(Activation('sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['categorical_accuracy'])
callbacks = [ReduceLROnPlateau(),EarlyStopping(patience=4),
ModelCheckpoint(filepath='model-conv1d.h5', save_best_only=True)]
history = model.fit(X_train, y_train,epochs=80,batch_size=500,
validation_split=0.1,verbose=2,callbacks=callbacks)
from keras import models
cnn_model = models.load_model('model-conv1d.h5')
from sklearn.metrics import accuracy_score
y_pred = cnn_model.predict(X_test)
accuracy_score(y_test,y_pred.round())
Out: 0.4405555555555556 (I think the neural network model has more room for improvement. But I'm not sure how to achieve that.)
我希望准确度至少达到60%。你们能给我一些建议,以改进我的Scikit-learn和Keras模型代码吗?
更具体地说,1.是否可以改善OneVsRestClassifier(SGDClassifier)? 2.有没有办法改善我的卷积神经网络?还是使用某种形式的递归神经网络? (我尝试了简单的RNN,但效果不佳。)
PS:以我计算精度的方式,对于描述(X),如果模型输出[0,0,0,1,0,1](y_pred),并且正确的输出为[0,0,0, 1,0,0](y_test),我的准确度应该是0,而不是5/6?
这个问题很长。非常感谢你们!