输入:作者努力训练,预测的种子
输出:从该种子生成的文本
我有原始文本,一个包含几千行文本的平面文本文件。我想将它输入到嵌入层,以便keras对数据进行矢量化。以下是我的文字:
--SNIP
The Wild West\n Ha ha, ride\n All you see is the sun reflectin\' off of the
--SNIP
and I call it input_text:
num_words = 2000#get 2000 words
tok = Tokenizer(num_words)#tokenize the words
tok.fit_on_texts(input_text)#takes in list of text to train on
#put all words from text into a words array
#this is essentially enumerating them
words = []
for iter in range(num_words):
words += [key for key,value in tok.word_index.items() if value==iter+1]
#words[:10]
#Class for vectorizing texts, or/and turning texts into sequences
#(=list of word indexes, where the word of rank i in the dataset (starting at 1) has index i).
X_train = tok.texts_to_sequences(input_text)#turns text to sequence, stating which word comes in what place
X_train = sequence.pad_sequences(X_train, maxlen=100)#pad sequence, essentially padding it with 0's at the end
y_train = words
似乎我的代码将接受序列,然后当我应用填充时,它只给出序列的前100个。我应该如何分开?
我应该采取整个序列并完成前100个单词(X),然后给出下一个单词(Y)并沿途做一些跳过吗?
我希望输出是下一个单词出现的概率。所以我最后有一个softmax层。基本上我想从种子生成文本。这是正确的做法吗?或者它只是更好