我有一个包含4000行和两列的数据集。第一列包含一些句子,第二列包含一些数字。 有大约4000个句子,它们由大约100个不同的数字分类。例如:
Sentences Codes
Google headquarters is in California 87390
Steve Jobs was a great man 70214
Steve Jobs has done great technology innovations 70214
Google pixel is a very nice phone 87390
Microsoft is another great giant in technology 67012
Bill Gates founded Microsoft 67012
类似地,总共有4000行包含这些句子,并且这些行被分类为100个这样的代码
我尝试了下面的代码,但是当我预测时,它为所有预测相同的值。在其他词中y_pred给出了一个相同值的数组。
我可以知道代码在哪里出错
import pandas as pd
import numpy as np
xl = pd.ExcelFile("dataSet.xlsx")
df = xl.parse('Sheet1')
#df = df.sample(frac=1).reset_index(drop=True)# shuffling the dataframe
df = df.sample(frac=1).reset_index(drop=True)# shuffling the dataframe
X = df.iloc[:, 0].values
Y = df.iloc[:, 1].values
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import pickle
count_vect = CountVectorizer()
X = count_vect.fit_transform(X)
tfidf_transformer = TfidfTransformer()
X = tfidf_transformer.fit_transform(X)
X = X.toarray()
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
y = Y.reshape(-1, 1) # Because Y has only one column
onehotencoder = OneHotEncoder(categories='auto')
Y = onehotencoder.fit_transform(y).toarray()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
inputDataLength = len(X_test[0])
outputDataLength = len(Y[0])
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.layers import Dropout
# fitting the model
embedding_vector_length = 100
model = Sequential()
model.add(Embedding(outputDataLength,embedding_vector_length, input_length=inputDataLength))
model.add(Dropout(0.2))
model.add(LSTM(outputDataLength))
model.add(Dense(outputDataLength, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=20)
y_pred = model.predict(X_test)
invorg = model.inverse_transform(y_test)
y_test = labelencoder_Y.inverse_transform(invorg)
inv = onehotencoder.inverse_transform(y_pred)
y_pred = labelencoder_Y.inverse_transform(inv)
答案 0 :(得分:2)
即使您有binary_crossentropy
个类,您仍在使用100
。这不是正确的事情。您必须使用categorical_crossentropy
来完成此任务。
像这样编译模型,
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
此外,您正在使用模型进行预测并转换为此类标签,
y_pred = model.predict(X_test)
inv = onehotencoder.inverse_transform(y_pred)
y_pred = labelencoder_Y.inverse_transform(inv)
由于使用softmax激活了模型以获取类标签,因此必须找到预测的argmax
。
例如,如果预测为[0.2, 0.3, 0.0005, 0.99]
,则必须采用argmax,这将为您提供输出3
。可能性很高的班级。
所以您必须像这样修改预测代码,
y_pred = model.predict(X_test)
y_pred = np.argmax(y_pred, axis=1)
y_pred = labelencoder_Y.inverse_transform(y_pred)
invorg = np.argmax(y_test, axis=1)
invorg = labelencoder_Y.inverse_transform(invorg)
现在,您将在invorg
中获得实际的类别标签,并在y_pred
中获得预测的类别标签