当“ input_data”超过4000长时,我的代码似乎会产生此错误。但是我想在18万个长数组上训练它。 我刚刚完成了一个文本生成类,并试图使我的模型生成一些Eminem歌词,实际上仅使用所有Eminem单词的5%(180k中的4k)就不会太糟糕。
'''
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import string
import numpy as np
import pandas as pd
# Eminem lyrics https://www.kaggle.com/thaddeussegura/eminem-lyrics-from-all-albums
from urllib.request import urlopen
data = urlopen('https://storage.googleapis.com/kagglesdsdata/datasets/835677/1426970/eminem_lyrics/ALL_eminem.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20200924%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200924T201536Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=9e8afd7dba5915b209e33905c68e93f2bfb1d3baac9456e1a0d16d1b74a0b482baa26bb6f348c2f901b46b63555b1a2bcc900c9db7d17321c27fe4578cc5d12463ca6b3e7c8998cf66a05a33b4b324dba3e48341d010f13a423debb8d1c2f52536870a9cc3ddfa72a4ca9bda874e934bcfdd21512e413e068bbd8c0a2a4042df66358d978080d164ead2f9e0edf1eee4bf66cf2f5c0aa63a5b7e9cea80ca6c211a0558aca9e7671235f105074f5f3f74abb882001acec29573c84b8ed9bf044b7233fb270a12fefe01bd40fe64b44cc0b89d54469357719d14404bb3c6033961c25af43c5c5f9c20fc090cf38fe03946058ecb9b67ebdfe4022c564480a2c73c').read().decode('utf-8')
# split
text = data.split()
# remove puctuation, make all lowercase
dataset = []
import re
for s in text:
s = re.sub(r'[^\w\s]','',s).lower()
dataset.append(s)
def tokenize_corpus(corpus, num_words=-1):
# Fit a Tokenizer on the corpus
if num_words > -1:
tokenizer = Tokenizer(num_words=num_words)
else:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
return tokenizer
# Tokenize the corpus
tokenizer = tokenize_corpus(dataset)
total_words = len(tokenizer.word_index) + 1
print(total_words)
# get inputs and outputs
input_data = []
labels = []
for i in range(180000):
tokens = np.array(sum(tokenizer.texts_to_sequences(dataset[i:i+11]), []))
input_data.append(tokens[:-1])
labels.append(tokens[-1])
input_data = np.array(input_data)
labels = np.array(labels)
#print(input_data)
#print(labels)
# One-hot encode the labels
one_hot_labels = tf.keras.utils.to_categorical(labels, num_classes=total_words)
我还尝试将'input_data'转换为张量,使其具有不同的dtype等,并且只会产生各种不同的错误。但是,如果将180000更改为小于4000的任何内容,一切都会很好
如果模型不能一次处理所有180,000个序列,我可以将其分解为45个数组(每个数组4000个),并在每个数组上训练5-10个历元吗?
型号:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
model = Sequential()
model.add(Embedding(total_words, 64, input_length=10))
model.add(Bidirectional(LSTM(32)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(input_data, one_hot_labels, epochs=100, verbose=1)
最后一行给出了错误,也许我应该更改模型本身中的某些内容?
剩下的,只是“ seed_text”是从类实验室复制而来的:
seed_text = "im feeling chills getting these bills still while having meal"
next_words = 100
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=10, padding='pre')
predicted_probs = model.predict(token_list)[0]
predicted = np.random.choice([x for x in range(len(predicted_probs))],
p=predicted_probs)
output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted:
output_word = word
break
seed_text += " " + output_word
print(seed_text)
请帮助解决该错误,如果您有任何想法可以改进整个模型,请告诉我。
答案 0 :(得分:0)
我发现大约4000个单词后,由于某种原因,令牌生成器开始生成不同长度的张量(未指定10个张量),因此需要多一行代码进行填充:
padded = pad_sequences(input_data, maxlen=10, padding="pre")