我正在尝试训练一个简单的Tensorflow模型以检测推文的情绪。数组的数据类型和大小是一致的,并且当recurrent_dropout设置为某个float值时,模型训练得很好。但这会禁用cuDNN,我真的想加快速度(不是所有人),但是每当我删除重复的dropout参数时,模型训练就会在第一个时期结束之前崩溃。
下面是相关代码,我省略了导入,并加载了csv文件。相关代码后是最终输入尺寸和错误代码。此外,我已经弄清楚了为什么colab似乎正在削减培训数据。 Colab在将其分为批次后显示序列的数量,因此使用默认批次大小32时,我们将获得859个序列。不使用循环辍学时的崩溃问题仍然是一个问题。附带说明,这段代码是一个非常粗糙的草稿,其中的数据清理工作都是在同一笔记本中完成的,因此缺少典型的格式。
def remove_case(X):
removed_case = []
X = X.copy()
for text in X:
text = str(text).lower()
removed_case.append(text)
X = removed_case
return X
def remove_hyperlinks(X):
removed_hyperlinks = []
X = X.copy()
for text in X:
text = str(text)
text = re.sub(r'http\S+', '', text)
text = re.sub(r'https\S+', '', text)
text = re.sub(r'www\S+', '', text)
removed_hyperlinks.append(text)
X = removed_hyperlinks
return X
def remove_punctuation(X):
removed_punc = []
X = X.copy()
for text in X:
text = str(text)
text = "".join([char for char in text if char not in punctuation])
removed_punc.append(text)
X = removed_punc
return X
def split_text(X):
split_tweets = []
X = X.copy()
for text in X:
text = str(text).split()
split_tweets.append(text)
X = split_tweets
return X
def map_sentiment(X, l, m, n):
keys = ['negative', 'neutral', 'positive']
values = [l, m, n]
dictionary = dict(zip(keys, values))
X = X.copy()
X = X.map(dictionary)
return X
# # def sentiment_to_onehot(X):
# sentiment_foofs = []
# X = X.copy()
# for integer in X:
# if integer == "negative": # Negative
# integer = [1, 0, 0]
# elif integer == "neutral": # Neutral
# integer = [0, 1, 0]
# elif integer == "positive": # Positive
# integer = [0, 0, 1]
# else:
# break
# sentiment_foofs.append(integer)
# X = sentiment_foofs
# return X
train_no_punc_lowercase = train.copy()
train_no_punc_lowercase['text'] = remove_case(train_no_punc_lowercase['text'])
train_no_punc_lowercase['text'] = remove_hyperlinks(train_no_punc_lowercase['text'])
train_no_punc_lowercase['text'] = remove_punctuation(train_no_punc_lowercase['text'])
train_no_punc_lowercase['sentiment'] = map_sentiment(train_no_punc_lowercase['sentiment'], 0, 1, 2)
train_no_punc_lowercase.head()
test_no_punc_lowercase = test.copy()
test_no_punc_lowercase['text'] = remove_case(test_no_punc_lowercase['text'])
test_no_punc_lowercase['text'] = remove_hyperlinks(test_no_punc_lowercase['text'])
test_no_punc_lowercase['text'] = remove_punctuation(test_no_punc_lowercase['text'])
test_no_punc_lowercase['sentiment'] = map_sentiment(test_no_punc_lowercase['sentiment'], 0, 1, 2)
features = train.columns.tolist()
features.remove('textID') # all unique, high cardinality feature
features.remove('selected_text') # target
target = 'selected_text'
X_train_no_punc_lowercase = train_no_punc_lowercase[features]
y_train_no_punc_lowercase = train_no_punc_lowercase[target]
X_test_no_punc_lowercase = test_no_punc_lowercase[features]
def stemming_column(df_column):
ps = PorterStemmer()
stemmed_word_list = []
for i, string in enumerate(df_column):
tokens = word_tokenize(string)
new_string = ""
for j, words in enumerate(tokens):
new_string = new_string + ps.stem(words) + " "
stemmed_word_list.append(new_string)
return stemmed_word_list
def create_lookup_table(list1, list2):
main_list = []
lookup_dict = {}
i = 1 # used to create a value in the dictionary
main_list.append(list1)
main_list.append(list2)
for list in main_list:
for string in list:
for word in string.split():
if word not in lookup_dict:
lookup_dict[word] = i
i += 1
return lookup_dict
def encode(input_list, input_dict):
encoded_list = []
for string in input_list:
sentence_list = []
for word in string.split():
sentence_list.append(input_dict[word]) # value lookup from dictionary.. int
encoded_list.append(sentence_list)
return encoded_list
def pad_data(list_of_lists):
padded_data = tf.keras.preprocessing.sequence.pad_sequences(list_of_lists, padding='post')
return padded_data
def create_array_sentiment_integers(list):
sent_int_list = []
for sentiment in list:
sent_int_list.append(sentiment)
return np.asarray(sent_int_list, dtype=np.int32)
X_train_stemmed_list = stemming_column(X_train_no_punc_lowercase['text'])
X_test_stemmed_list = stemming_column(X_test_no_punc_lowercase['text'])
lookup_table = create_lookup_table(X_train_stemmed_list, X_test_stemmed_list)
X_train_encoded_list = encode(X_train_stemmed_list, lookup_table)
X_train_padded_data = pad_data(X_train_encoded_list)
Y_train = create_array_sentiment_integers(train_no_punc_lowercase['sentiment'])
max_features = 3 # 3 choices 0, 1, 2
Y_train_final = np.zeros((Y_train.shape[0], max_features), dtype=np.float32)
Y_train_final[np.arange(Y_train.shape[0]), Y_train] = 1.0
input_dimension = len(lookup_table) + 1
output_dimension = 64
input_length = 33
model = Sequential()
model.add(tf.keras.layers.Embedding(input_dim=input_dimension,
output_dim=output_dimension,
input_length=input_length,
mask_zero=True))
model.add(tf.keras.layers.LSTM(512, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
model.add(tf.keras.layers.Dense(256, activation='sigmoid'))
model.add(tf.keras.layers.Dropout(0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X_train_padded_data, Y_train_final, validation_split=0.20, epochs=10)
model.save('Tweet_sentiment.model')
此外,这是数据集的形状。
x train shape: (27481, 33, 1) x train type: <class 'numpy.ndarray'> y train shape: (27481, 3)
错误代码
Epoch 1/3
363/859 [===========>..................] - ETA: 9s - loss: 0.5449 - accuracy: 0.5674
---------------------------------------------------------------------------
UnknownError Traceback (most recent call last)
<ipython-input-103-1d4af3962607> in <module>()
----> 1 model.fit(X_train_padded_data, Y_train_final, epochs=3,)
8 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
58 ctx.ensure_initialized()
59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60 inputs, attrs, num_outputs)
61 except core._NotOkStatusException as e:
62 if name is not None:
UnknownError: [_Derived_] CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1496): 'cudnnSetRNNDataDescriptor( data_desc.get(), data_type, layout, max_seq_length, batch_size, data_size, seq_lengths_array, (void*)&padding_fill)'
[[{{node cond_38/then/_0/CudnnRNNV3}}]]
[[sequential_5/lstm_4/StatefulPartitionedCall]] [Op:__inference_train_function_36098]
Function call stack:
train_function -> train_function -> train_function
答案 0 :(得分:0)
我在您的代码中看到了一些问题。它们在下面提到:
您正在使用input_dimension = len(lookup_table) + 1
。 len(lookup_table)
只是Number of Time Steps
。它的价值将非常高,至少超过30,000。建议仅使用这些值的子集。因此,您可以设置input_dimension = 10000
或input_dimension = 15000
(您可以尝试使用此值),它应该可以解决问题。话虽如此,它不会影响模型的准确性。
为什么将 Recurrent Dropout
设置为浮点型有效值==>当我们设置Recurrent Dropout
时,它实际上会丢弃Number of Time Steps
,{{ 1}},因此不会崩溃。
input_dimension
之后有另一个return_sequences=True
时,才应使用LSTM Layer
。由于您只有一个LSTM Layer
,因此LSTM Layer
应该设置为return_sequences
False
。如果您不是binary_crossentropy
的{{1}},则应使用sparse_categorical_crossentropy
;如果您是 One-Hot-Encoding
的Target
,则应使用categorical_crossentropy
。One-Hot-Encoding
中使用Target
吗?此外,我看到您正在使用Masking
的许多功能和多行代码,例如删除Embedding Layer
,删除Data-Preprocessing
,Hyperlinks
等。>
所以,我想我会为 Punctuations
提供 Tokenizing
,这对您和 {{1 }} 。相同的代码如下所示:
End-To-End Tutorial
有关更多信息,请参阅此Beautiful Article。
希望这可以解决您的问题。学习愉快!