Question

我正在尝试为自己的机器学习实现10倍交叉验证，但是我对数据集的特征计数有一些麻烦。

我有一个仅包含二进制数据的数据集。我正在尝试自己实施10倍交叉验证。我分割了数据集，找到了测试和训练集，然后保存了它们。

我的实际数据文件是稀疏向量。在上面的split选项之后，我得到了10个由稀疏矢量组成的测试文件和10个由稀疏矢量组成的训练文件。问题在于，由于测试和训练仅设置实际文件的子集，因此实际数据文件具有比测试和训练数据集更高的特征数量。因此，当我尝试应用Spark ML算法时，会收到一条错误消息，说明这种差异。

        counter = 10; // fold count
        // filePath is the path of the file which contains the sparse vector file of my actual dataset, i and fileName is just for information messages, I am doing these steps for X times and getting the average of them to get more accurate result. i is the counter for X.
        ArrayList<ArrayList<Dataset<Row>>> datasets = splitAccordingTo10FoldCrossValidation(filePath, i, fileName, sparkBase); 
        for(int k=0; k<counter; k++){

            // Every row of datasets object contains test data in 0. index and training data in 1. index. Here for example 0. index is one of the folds which is a test dataset and 1. index is the sum of the rest of the folds which is training set for current iteration. datasets object has size of 10 which is equal to cross validation count.
            testData = datasets.get(k).get(0);
            trainingData = datasets.get(k).get(1);

            StringIndexerModel labelIndexer = new StringIndexer()
                    .setInputCol("label")
                    .setOutputCol("indexedLabel")
                    .fit(trainingData);

            VectorIndexerModel featureIndexer = new VectorIndexer()
                    .setInputCol("features")
                    .setOutputCol("indexedFeatures")
                    .setMaxCategories(4)
                    .fit(trainingData);

            DecisionTreeClassifier dt = new DecisionTreeClassifier()
                    .setLabelCol("indexedLabel")
                    .setFeaturesCol("indexedFeatures");

            IndexToString labelConverter = new IndexToString()
                    .setInputCol("prediction")
                    .setOutputCol("predictedLabel")
                    .setLabels(labelIndexer.labels());

            Pipeline pipeline = new Pipeline()
                    .setStages(new PipelineStage[]{labelIndexer, featureIndexer, dt, labelConverter});

            trainingData.sparkSession().read().format("libsvm");
            PipelineModel model = pipeline.fit(trainingData);

            // I got the error that I mentioned below explanations in this row.
            Dataset<Row> predictions = model.transform(testData);

            MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator()
                    .setLabelCol("indexedLabel")
                    .setPredictionCol("prediction")
                    .setMetricName("accuracy");

            // I will divide these to 10 later.
            accuracySum += evaluator.evaluate(predictions);

            evaluator.setMetricName("weightedPrecision");
            precisionSum += (evaluator.evaluate(predictions));

            evaluator.setMetricName("weightedRecall");
            recallSum += (evaluator.evaluate(predictions));
        }

分割数据集后，我将它们保存在其他文件中。对于每次迭代，测试和培训数据集位于不同的数据文件中。因此，我有10个测试和10个训练数据集，其中包含稀疏向量。这是我阅读它们的方式：

Dataset<Row> data = sparkBase
            .getSpark()
            .read()
            .format("libsvm")
            .load(filePath);

我得到预测的部分抛出“ java.lang.AssertionError：断言失败：VectorIndexerModel预期长度为842的向量，但发现长度为841”错误。我该如何解决？

如何使用Apache Spark获得10倍交叉验证测试和训练数据集？

0 个答案: