Question

我有一个数据集，其中包含以下问题和答案：

try:
    #your code that might throw zero division error here
except ZeroDivisionError as zde:
    #do something else in case you get this error code here

示例：

[question...]?\t[answer...].

现在我想用它训练一个keras模型。但是当我加载它时，我无法将其转换为一个numpy数组，因为句子的长度不一样。

在 input_text 和 out_text 中，我将问题和答案存储为分词，如下所示：

Do you like pizza?     Yes its delicious.
...

这是我的代码的一部分。（我还将单词转换成具有自制函数的向量）

[["Do", "you", "like", "pizza", "?"] 
 [ ... ]]

也许可以展示如何执行此操作或链接到类似数据集的示例，以及如何将其加载到keras的numpy数组中。

感谢您的回复。

Answer 1

我不知道任何专门针对质量检查的教程，但是在Tensorflow official website上有一个很好的有关相关问题的教程。

由于我们的训练数据必须具有相同的长度，因此我们通常使用填充函数来标准化长度。例如：

from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.preprocessing.text import Tokenizer

question = ['thi is test', 'this is another test 2']
answers = ['i am answer', 'second answer is here']

tknizer = Tokenizer()
tknizer.fit_on_texts(question + answers)
question = tknizer.texts_to_sequences(question)
answer = tknizer.texts_to_sequences(answer)
question = pad_sequences(question, value=0, padding='post', maxlen=20)
answer = pad_sequences(answer, value=0, padding='post', maxlen=20)

print(question)

输出：

[[4 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [5 1 6 2 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

在上面的示例中，我们假设最大长度为20。长度大于20的序列被截断，以使其适合所需的长度，而长度小于20的序列最后以0填充。

现在，您可以将预处理的数据输入Keras：

inps1 = Input(shape=(20,))
inps2 = Input(shape=(20,))
embedding = Embedding(10000, 100)
emb1 = embedding(inps1)
emb2 = embedding(inps2)
# ... rest of the network
pred = Dense(100,'softmax')(prev_layer)
model = Model(inputs=[inps1, inps2],outputs=pred)
model.compile('adam', 'categorical_crossentropy')
model.fit([question, answer], labels, batch_size=128, epoch=100)

如何将文本数据集（问题和答案）加载到numpy数组中以训练keras模型

1 个答案: