Question

我正在研究文本分类问题，该问题有大约100万条评论，我必须以此为基础来预测情感。但这是我的数据集的一瞥：

In [3]:df=pd.read_csv('amazon_review.csv')
       df.head()
Out [3]:    
Review_no   reviewText                                    Sentiment
    1   I enjoy vintage books and movies so I enjoyed ...   Happy
    2   This book is a reissue of an old one; the auth...   Happy
    3   This was a fairly interesting read. It had ol...    Happy
    4   I'd never read any of the Amy Brewster mysteri...   Happy
    5   If you like period pieces - clothing, lingo, y...   Happy

这是我正在使用的代码：

In [10]: X_train, X_test, y_train, y_test = train_test_split(X, y, 
         test_size=0.33)
In[11]: y_train = y_train.map({"Happy": 1, "Content" : 2, "Unhappy" : 3 })
         y_test = y_test.map({"Happy": 1, "Content" : 2, "Unhappy" : 3 })
In [12]: y_train = to_categorical(y_train)
         y_test=to_categorical(y_test)
In [13]: all_text=X_train.append(X_test)
In [14]: from sklearn.feature_extraction.text import TfidfVectorizer

         word_vectorizer = TfidfVectorizer(
           sublinear_tf=True,
           strip_accents='unicode',
           analyzer='word',
           token_pattern=r'\w{1,}',
           stop_words='english',
           ngram_range=(1, 1),
           max_features=20000)
         word_vectorizer.fit(all_text)
         train_word_features = word_vectorizer.transform(X_train)
         test_word_features = word_vectorizer.transform(X_test)
In [15]: train_word_features.shape
Out [15]: (658118,20000)
In [16]: model = Sequential()
         model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2,input_shape= 
         (658118,20000)))
         model.add(Dense(4, activation='softmax'))
In [17]: model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
In [18]: model.fit(train_word_features, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(test_word_features, y_test))

我收到此错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-90-46b758e790c5> in <module>()
      2           batch_size=batch_size,
      3           epochs=15,
----> 4           validation_data=(test_word_features, y_test))

~\Anaconda3\lib\site-packages\keras\models.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
    958                               initial_epoch=initial_epoch,
    959                               steps_per_epoch=steps_per_epoch,
--> 960                               validation_steps=validation_steps)
    961 
    962     def evaluate(self, x, y, batch_size=32, verbose=1,

~\Anaconda3\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1572             class_weight=class_weight,
   1573             check_batch_axis=False,
-> 1574             batch_size=batch_size)
   1575         # Prepare validation data.
   1576         do_validation = False

~\Anaconda3\lib\site-packages\keras\engine\training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_batch_axis, batch_size)
   1405                                     self._feed_input_shapes,
   1406                                     check_batch_axis=False,
-> 1407                                     exception_prefix='input')
   1408         y = _standardize_input_data(y, self._feed_output_names,
   1409                                     output_shapes,

~\Anaconda3\lib\site-packages\keras\engine\training.py in _standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
    139                                  ' to have ' + str(len(shapes[i])) +
    140                                  ' dimensions, but got array with shape ' +
--> 141                                  str(array.shape))
    142             for j, (dim, ref_dim) in enumerate(zip(array.shape, shapes[i])):
    143                 if not j and not check_batch_axis:

ValueError: Error when checking input: expected lstm_14_input to have 3 dimensions, but got array with shape (658118, 20000)

所以我无法理解问题出在哪里。如何更改尺寸？ .to_array会导致内存错误，因为它的数据集很大。

Answer 1

根据文档TfidfVectorizer给您一个稀疏矩阵：

返回：
X：稀疏矩阵，[n_samples，n_features] TF-IDF加权文档期限矩阵。

LSTM用于序列，因此您的数据必须采用序列的形式。您的数据是一个稀疏矩阵。无论如何，我认为您也应该研究单词嵌入，因为在LSTM中使用TfidfVectorizer的输出没有意义。如果您想使用TFIDF，请看一看单词模型包。

也许看一些关于情感分析的博客，有成千上万的博客。 example

在Keras中具有LSTM的Tf-Idf矢量化器错误：预期LSTM具有3维

1 个答案: