与培训相比,测试成绩更高,分别为92%和83%。感觉有些不对劲,你能找到什么吗?

时间:2019-01-12 16:32:27

标签: python-3.x lstm kaggle

因此,我最近开始对kaggle进行泰坦尼克号挑战。 我决定使用LSTM层,并且我的代码基于以下页面: https://www.kaggle.com/lusob04/titanic-rnn/notebook 通过使用字典,我将所有单词转换为整数,这些整数将用作LSTM的输入。 我正在使用18个步骤来训练我的LSTM,其中每个步骤都是转换后的单词。 但是,我在测试中获得的结果要比验证和培训都高,而且即使我看到人们挣扎超过85%,而我却获得了92%左右,我的得分也达到了很高的水平。 您能找到我的方法是否有问题? 谢谢

  import numpy as np
import pandas as pd
from collections import Counter
import matplotlib as plt
plt.use('TkAgg')
import matplotlib.pyplot as plt
import seaborn as sns
from keras.layers import Dense, LSTM, Activation, Dropout
from keras.models import Sequential
from keras.optimizers import Adam


## READING DATA ####
feature_sets_train_csv = pd.read_csv('all/train.csv')

feature_sets_test = pd.read_csv('all/test.csv')

feature_sets_conc = pd.concat([feature_sets_train_csv, feature_sets_test],sort=False)

passengers = [' '.join(map(str,passenger[[2,3,4,5,6,7,8,9,10]]))
              for passenger in feature_sets_train_csv.values]

passengers_test = [' '.join(map(str,passenger[[1,2,3,4,5,6,7,8,9]]))
              for passenger in feature_sets_test.values]

survived = [' '.join(map(str,passenger[[1]]))
              for passenger in feature_sets_train_csv.values]

labels_test = pd.read_csv("all/gender_submission.csv")
labels_test = labels_test.Survived

feature_sets_train = passengers
feature_sets_test = passengers_test
labels = survived

sns.barplot(x=feature_sets_train_csv.Sex,y=feature_sets_train_csv.Survived, hue=feature_sets_train_csv.Pclass)
#plt.show()

## Processing Data ###

passengers_all = [' '.join(map(str,passenger[[0,1,2,3,4,5,6,7,8,9,10,11]])) for passenger in feature_sets_conc.values]

all_text = ' '.join(passengers_all)
words = all_text.split()
counts = Counter(words)
vocab = sorted(counts, key = counts.get, reverse=True)

vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

feature_sets_ints = []
feature_sets_ints_test = []

for each in feature_sets_train:
    feature_sets_ints.append([vocab_to_int[word]
                              for word in each.split()])

for each in feature_sets_test:
    feature_sets_ints_test.append([vocab_to_int[word]
                              for word in each.split()])

features = np.zeros((len(feature_sets_ints), 24), dtype=int)

for i, row in enumerate(feature_sets_ints):
    features[i, -len(row):] = np.array(row)[:24]

features = features[:,6:24]

features_test = np.zeros((len(feature_sets_ints_test), 24), dtype=int)

for i, row in enumerate(feature_sets_ints_test):
    features_test[i, -len(row):] = np.array(row)[:24]

features_test = features_test[:,6:24]

labels = np.array(labels)
labels_test = np.array(labels_test)

train = features[0:650,:]
test = features_test[0:410]
val = features[651:851]

labels_train = labels[0:650]
labels_test = labels_test[0:410]
labels_val = labels[651:851]
#### MODEL #####
batch_size = 10
epoch = 15
train = train.reshape(train.shape[0],-1,1)
test = test.reshape(test.shape[0],-1,1)
val = val.reshape(val.shape[0],-1,1)

model = Sequential()
model.add(LSTM(256, input_shape=train.shape[1:],batch_size=batch_size))
model.add(Activation('sigmoid'))
model.add(Dense(1))
model.compile(optimizer='Adam', loss = 'mean_squared_error',metrics = ['accuracy'] )
model.fit(train,labels_train, batch_size=batch_size, epochs=epoch,validation_data=(val,labels_val), verbose = 1)

scores = model.evaluate(test, labels_test, batch_size=batch_size)
predictions = model.predict(test, batch_size = batch_size)

print('LSTM test score:', scores[0])
print('LSTM test accuracy:', scores[1])
rounded = [round(x[0]) for x in predictions]

0 个答案:

没有答案