ValueError:无法将NumPy数组转换为数组大小超过4000的Tensor(不支持的对象类型numpy.ndarray)

时间:2020-09-25 19:33:01

标签: python numpy tensorflow deep-learning neural-network

当“ input_data”超过4000长时,我的代码似乎会产生此错误。但是我想在18万个长数组上训练它。 我刚刚完成了一个文本生成类,并试图使我的模型生成一些Eminem歌词,实际上仅使用所有Eminem单词的5%(180k中的4k)就不会太糟糕。

'''

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import string
import numpy as np
import pandas as pd


# Eminem lyrics https://www.kaggle.com/thaddeussegura/eminem-lyrics-from-all-albums

from urllib.request import urlopen

data = urlopen('https://storage.googleapis.com/kagglesdsdata/datasets/835677/1426970/eminem_lyrics/ALL_eminem.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20200924%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200924T201536Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=9e8afd7dba5915b209e33905c68e93f2bfb1d3baac9456e1a0d16d1b74a0b482baa26bb6f348c2f901b46b63555b1a2bcc900c9db7d17321c27fe4578cc5d12463ca6b3e7c8998cf66a05a33b4b324dba3e48341d010f13a423debb8d1c2f52536870a9cc3ddfa72a4ca9bda874e934bcfdd21512e413e068bbd8c0a2a4042df66358d978080d164ead2f9e0edf1eee4bf66cf2f5c0aa63a5b7e9cea80ca6c211a0558aca9e7671235f105074f5f3f74abb882001acec29573c84b8ed9bf044b7233fb270a12fefe01bd40fe64b44cc0b89d54469357719d14404bb3c6033961c25af43c5c5f9c20fc090cf38fe03946058ecb9b67ebdfe4022c564480a2c73c').read().decode('utf-8')


# split
text = data.split()

# remove puctuation, make all lowercase
dataset = []
import re
for s in text:
    s = re.sub(r'[^\w\s]','',s).lower()
    dataset.append(s)


def tokenize_corpus(corpus, num_words=-1):
  # Fit a Tokenizer on the corpus
  if num_words > -1:
    tokenizer = Tokenizer(num_words=num_words)
  else:
    tokenizer = Tokenizer()
  tokenizer.fit_on_texts(corpus)
  return tokenizer

# Tokenize the corpus
tokenizer = tokenize_corpus(dataset)

total_words = len(tokenizer.word_index) + 1
print(total_words)


# get inputs and outputs
input_data = []
labels = []
for i in range(180000):
    tokens = np.array(sum(tokenizer.texts_to_sequences(dataset[i:i+11]), []))
    input_data.append(tokens[:-1])
    labels.append(tokens[-1])

input_data = np.array(input_data)
labels = np.array(labels)

#print(input_data)
#print(labels)

# One-hot encode the labels
one_hot_labels = tf.keras.utils.to_categorical(labels, num_classes=total_words)

我还尝试将'input_data'转换为张量,使其具有不同的dtype等,并且只会产生各种不同的错误。但是,如果将180000更改为小于4000的任何内容,一切都会很好

如果模型不能一次处理所有180,000个序列,我可以将其分解为45个数组(每个数组4000个),并在每个数组上训练5-10个历元吗?

型号:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional

model = Sequential()
model.add(Embedding(total_words, 64, input_length=10))
model.add(Bidirectional(LSTM(32)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(input_data, one_hot_labels, epochs=100, verbose=1)

最后一行给出了错误,也许我应该更改模型本身中的某些内容?

剩下的,只是“ seed_text”是从类实验室复制而来的:

seed_text = "im feeling chills getting these bills still while having meal"
next_words = 100
  
for _ in range(next_words):
  token_list = tokenizer.texts_to_sequences([seed_text])[0]
  token_list = pad_sequences([token_list], maxlen=10, padding='pre')
  predicted_probs = model.predict(token_list)[0]
  predicted = np.random.choice([x for x in range(len(predicted_probs))],
                               p=predicted_probs)
  output_word = ""
  for word, index in tokenizer.word_index.items():
    if index == predicted:
      output_word = word
      break
  seed_text += " " + output_word
print(seed_text)

请帮助解决该错误,如果您有任何想法可以改进整个模型,请告诉我。

1 个答案:

答案 0 :(得分:0)

我发现大约4000个单词后,由于某种原因,令牌生成器开始生成不同长度的张量(未指定10个张量),因此需要多一行代码进行填充:

padded = pad_sequences(input_data, maxlen=10, padding="pre")