我正在研究文本分类问题,该问题有大约100万条评论,我必须以此为基础来预测情感。但这是我的数据集的一瞥:
In [3]:df=pd.read_csv('amazon_review.csv')
df.head()
Out [3]:
Review_no reviewText Sentiment
1 I enjoy vintage books and movies so I enjoyed ... Happy
2 This book is a reissue of an old one; the auth... Happy
3 This was a fairly interesting read. It had ol... Happy
4 I'd never read any of the Amy Brewster mysteri... Happy
5 If you like period pieces - clothing, lingo, y... Happy
这是我正在使用的代码:
In [10]: X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33)
In[11]: y_train = y_train.map({"Happy": 1, "Content" : 2, "Unhappy" : 3 })
y_test = y_test.map({"Happy": 1, "Content" : 2, "Unhappy" : 3 })
In [12]: y_train = to_categorical(y_train)
y_test=to_categorical(y_test)
In [13]: all_text=X_train.append(X_test)
In [14]: from sklearn.feature_extraction.text import TfidfVectorizer
word_vectorizer = TfidfVectorizer(
sublinear_tf=True,
strip_accents='unicode',
analyzer='word',
token_pattern=r'\w{1,}',
stop_words='english',
ngram_range=(1, 1),
max_features=20000)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(X_train)
test_word_features = word_vectorizer.transform(X_test)
In [15]: train_word_features.shape
Out [15]: (658118,20000)
In [16]: model = Sequential()
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2,input_shape=
(658118,20000)))
model.add(Dense(4, activation='softmax'))
In [17]: model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
In [18]: model.fit(train_word_features, y_train,
batch_size=batch_size,
epochs=15,
validation_data=(test_word_features, y_test))
我收到此错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-90-46b758e790c5> in <module>()
2 batch_size=batch_size,
3 epochs=15,
----> 4 validation_data=(test_word_features, y_test))
~\Anaconda3\lib\site-packages\keras\models.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
958 initial_epoch=initial_epoch,
959 steps_per_epoch=steps_per_epoch,
--> 960 validation_steps=validation_steps)
961
962 def evaluate(self, x, y, batch_size=32, verbose=1,
~\Anaconda3\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
1572 class_weight=class_weight,
1573 check_batch_axis=False,
-> 1574 batch_size=batch_size)
1575 # Prepare validation data.
1576 do_validation = False
~\Anaconda3\lib\site-packages\keras\engine\training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_batch_axis, batch_size)
1405 self._feed_input_shapes,
1406 check_batch_axis=False,
-> 1407 exception_prefix='input')
1408 y = _standardize_input_data(y, self._feed_output_names,
1409 output_shapes,
~\Anaconda3\lib\site-packages\keras\engine\training.py in _standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
139 ' to have ' + str(len(shapes[i])) +
140 ' dimensions, but got array with shape ' +
--> 141 str(array.shape))
142 for j, (dim, ref_dim) in enumerate(zip(array.shape, shapes[i])):
143 if not j and not check_batch_axis:
ValueError: Error when checking input: expected lstm_14_input to have 3 dimensions, but got array with shape (658118, 20000)
所以我无法理解问题出在哪里。如何更改尺寸?
.to_array
会导致内存错误,因为它的数据集很大。
答案 0 :(得分:0)
根据文档TfidfVectorizer给您一个稀疏矩阵:
返回:
X:稀疏矩阵,[n_samples,n_features] TF-IDF加权文档期限矩阵。
LSTM用于序列,因此您的数据必须采用序列的形式。您的数据是一个稀疏矩阵。无论如何,我认为您也应该研究单词嵌入,因为在LSTM中使用TfidfVectorizer的输出没有意义。如果您想使用TFIDF,请看一看单词模型包。
也许看一些关于情感分析的博客,有成千上万的博客。 example