Question

我想用sklearn拆分数据集，因为我认为validate_split对我不起作用。我实际读取数据集的方式如下：

input_sentences = []
output_sentences = []
output_sentences_inputs = []    #Translated data

count = 0
for line in open(r'/content/drive/My Drive/TEMPPP/123.txt', encoding="utf-8"):
    count += 1

    if count > NUM_SENTENCES:
        break

    if '\t' not in line:
        continue

    input_sentence, output = line.rstrip().split('\t')

    output_sentence = output + ' <eos>'
    output_sentence_input = '<sos> ' + output

    input_sentences.append(input_sentence)
    output_sentences.append(output_sentence)
    output_sentences_inputs.append(output_sentence_input)

现在，我对如何使用scikit学习感到困惑。现在这就是我所做的。

from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(input_sentences, output_sentences, test_size = 0.2, random_state = 1)

第一件事，这是正确的方法吗？？

如果没有？那我该如何分割呢？

如果是？然后帮助我解决这个困惑：我正在将input_sentences和output_sentences传递到我的图层，现在我需要传递什么？我是否仍像以前一样传递input_sentences和output_sentences并训练具有完整数据集的模型，还是只需要发送xTrain和yTrain ???？而且xTest和yTest永远不会通过层传递，仅用于验证？

Answer 1

根据您的代码，看来您当前正在做的是正确的。

不，您可以忘记input_sentences和output_sentences，从现在开始，仅使用train_test_split创建的数组。

如果您正在使用具有fit()方法的ML算法-该方法将使用xTrain和yTrain。 predict()方法将使用xTest，并且您使用yTest来检查predict()方法中预测的准确性。说，打sklearn.metrics.r2_score(predictions, yTest)。

还请注意，使用多个问号来结束句子使之sound interrogatory and impolite。您在这里寻求帮助，因此请注意标点符号。

如何将此数据集分为训练和验证集？

1 个答案: