我正在使用sentiment140数据集来尝试和学习使用RNN进行的情感分析。我在网上找到了使用keras.imdb
数据源的教程,但是我想尝试并使用自己的数据源,因此我尝试将代码修改为我自己的数据。
教程:https://towardsdatascience.com/a-beginners-guide-on-sentiment-analysis-with-rnn-9e100627c02e
数据预处理包括提取序列数据,然后对其进行标记和填充,然后再将其发送到模型进行训练。我在下面的代码中执行了这些操作,但是每当尝试进行培训时,我都会得到if isinstance(data[0], list):IndexError: list index out of range
。我没有定义data
,所以这使我相信我做了keras或tensorflow不喜欢的事情。关于什么导致此错误的任何想法?
我的数据当前为csv文件格式,标题为SENTIMENT
和TEXT
。 SENTIMENT
为0
表示否定,1
为肯定。 TEXT
是已收集的已处理推文。这是一个样本。
数据集CSV(仅视图线可以节省空间)
SENTIMENT,TEXT
0,about to file tax
0,ahh i hate dogs
1,My paycheck came in today
1,lot to do before chi this weekend
1,lol love food
代码
import pandas as pd
import keras
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import json
import numpy as np
# Load in DS
df = pd.read_csv('./train.csv')
print(df.head())
#Create sequence
vocabulary_size = 1000
tokenizer = Tokenizer(num_words= vocabulary_size, split=' ')
tokenizer.fit_on_texts(df['TEXT'].values)
X_train = tokenizer.texts_to_sequences(df['TEXT'].values)
#Pad Sequence
X_train = pad_sequences(X_train)
print(X_train)
#Get Sentiment
y_train = df['SENTIMENT'].tolist()
#create model
max_words = 24
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
embedding_size=32
model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
batch_size = 64
num_epochs = 3
X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]
model.fit(X_train2, y_train2,
validation_data=(X_valid, y_valid),
batch_size=batch_size,
epochs=num_epochs)
输出
Using TensorFlow backend.
SENTIMENT TEXT
0 0 aww that be bummer You shoulda get david carr ...
1 0 be upset that he can not update his facebook b...
2 0 I dive many time for the ball manage to save t...
3 0 my whole body feel itchy and like its on fire
4 0 no it be not behave at all be mad why be here ...
[[ 0 0 0 ... 3 10 5]
[ 0 0 0 ... 46 47 89]
[ 0 0 0 ... 29 9 96]
...
[ 0 0 0 ... 30 309 310]
[ 0 0 0 ... 0 0 72]
[ 0 0 0 ... 33 312 313]]
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 24, 32) 32000
_________________________________________________________________
lstm_1 (LSTM) (None, 100) 53200
_________________________________________________________________
dense_1 (Dense) (None, 1) 101
=================================================================
Total params: 85,301
Trainable params: 85,301
Non-trainable params: 0
_________________________________________________________________
None
Traceback (most recent call last):
File "mcve.py", line 50, in <module>
epochs=num_epochs)
File "/home/dv/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py", line 950, in fit
batch_size=batch_size)
File "/home/dv/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py", line 787, in _standardize_user_data
exception_prefix='target')
File "/home/dv/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training_utils.py", line 79, in standardize_input_data
if isinstance(data[0], list):
IndexError: list index out of range
JUPYTER笔记本错误
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-25-184505b70981> in <module>()
20 model.fit(X_train2, y_train2,
21 batch_size=batch_size,
---> 22 epochs=num_epochs)
23
~/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
948 sample_weight=sample_weight,
949 class_weight=class_weight,
--> 950 batch_size=batch_size)
951 # Prepare validation data.
952 do_validation = False
~/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_array_lengths, batch_size)
785 feed_output_shapes,
786 check_batch_axis=False, # Don't enforce the batch size.
--> 787 exception_prefix='target')
788
789 # Generate sample-wise weight values given the `sample_weight` and
~/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
77 'for each key in: ' + str(names))
78 elif isinstance(data, list):
---> 79 if isinstance(data[0], list):
80 data = [np.asarray(d) for d in data]
81 elif len(names) == 1 and isinstance(data[0], (float, int)):
IndexError: list index out of range
答案 0 :(得分:2)
修改
我以前的建议是错误的。我已经检查了您的代码并运行了它,对我来说它没有错误。
然后,我看了源代码standardize_input_data
函数。有一行检查data
参数:
def standardize_input_data(data,
names,
shapes=None,
check_batch_axis=True,
exception_prefix=''):
"""Normalizes inputs and targets provided by users.
Users may pass data as a list of arrays, dictionary of arrays,
or as a single array. We normalize this to an ordered list of
arrays (same order as `names`), while checking that the provided
arrays have shapes that match the network's expectations.
# Arguments
data: User-provided input data (polymorphic).
...
第79行:
elif isinstance(data, list):
if isinstance(data[0], list):
...
因此,看起来如果发生错误,输入数据为list
,但列表的长度为零。
通过调用Model._standardize_user_data(...)在Model.fit(...)方法内部调用standartize_input_data
函数。通过此功能链,传递的data
参数将得到x
参数Model.fit(x, y, ...)
的值。因此,我猜想是X_train2
或X_valid
的类型或内容存在问题。除了X_train2
内容之外,您还会提供X_val
和X_train
吗?
旧的错误建议
我想您应该将词汇量增加一倍,以应对词汇外的记号。
即,更改Embedding
层的初始化:
model.add(Embedding(vocabulary_size + 1, embedding_size, input_length=max_words))
根据docs,“ input_dim:整数>0。词汇量,即最大整数索引+ 1”。
您可以查看最高max(X_train)
的值(已编辑)。
希望有帮助!