我是机器学习的新手,所以这个问题听起来很愚蠢。 我正在追踪tutorial on Text Classification,但遇到一个错误,我不知道如何解决。
这是我的代码(基本上就是本教程中找到的代码)
import pandas as pd
filepath_dict = {'yelp': 'data/yelp_labelled.txt',
'amazon': 'data/amazon_cells_labelled.txt',
'imdb': 'data/imdb_labelled.txt'}
df_list = []
for source, filepath in filepath_dict.items():
df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
df['source'] = source
df_list.append(df)
df = pd.concat(df_list)
print(df.iloc[0:4])
from sklearn.feature_extraction.text import CountVectorizer
df_yelp = df[df['source'] == 'yelp']
sentences = df_yelp['sentence'].values
y = df_yelp['label'].values
from sklearn.model_selection import train_test_split
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=1000)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
from keras.models import Sequential
from keras import layers
input_dim = X_train.shape[1]
model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.summary()
history = model.fit(X_train, y_train,
nb_epoch=100,
verbose=False,
validation_data=(X_test, y_test),
batch_size=10)
到达最后一行时,我得到一个错误
“ TypeError:稀疏矩阵长度不明确;请使用getnnz()或shape [0]”
我想我必须对正在使用的数据进行某种转换,否则我应该尝试以其他方式加载这些数据。我已经尝试搜索Stackoverflow,但是-对所有这些都是新手-我找不到任何有用的东西。
我如何进行这项工作?理想情况下,我不仅要获得解决方案,还希望获得有关错误发生原因以及解决方案如何解决的简短说明。
谢谢!
答案 0 :(得分:2)
您遇到此困难的原因是您的X_train
和X_test
的类型为<class scipy.sparse.csr.csr_matrix>
,而您的模型希望它是一个numpy数组。
尝试将它们投射到密集状态,就可以了:
X_train = X_train.todense()
X_test = X_test.todense()
答案 1 :(得分:1)
不确定,为什么这个脚本出错?
以下脚本运行正常;即使是稀疏矩阵可以在您的机器上尝试一下。
sentences = ['i want to test this','let us try this',
'would this work','how about this',
'even this','this should not work']
y= [0,0,0,0,0,1]
from sklearn.model_selection import train_test_split
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=1000)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
from keras.models import Sequential
from keras import layers
input_dim = X_train.shape[1]
model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.summary()
model.fit(X_train, y_train,
epochs=2,
verbose=True,
validation_data=(X_test, y_test),
batch_size=2)
#
Layer (type) Output Shape Param #
=================================================================
dense_5 (Dense) (None, 10) 110
_________________________________________________________________
dense_6 (Dense) (None, 1) 11
=================================================================
Total params: 121
Trainable params: 121
Non-trainable params: 0
_________________________________________________________________
Train on 4 samples, validate on 2 samples
Epoch 1/2
4/4 [==============================] - 1s 169ms/step - loss: 0.7570 - acc: 0.2500 - val_loss: 0.6358 - val_acc: 1.0000
Epoch 2/2
4/4 [==============================] - 0s 3ms/step - loss: 0.7509 - acc: 0.2500 - val_loss: 0.6328 - val_acc: 1.0000