Question

我正在开展一个项目，我必须使用Spark的MLlib进行一些K-means聚类。问题是我的数据有744个功能。我做了一些研究，发现PCA是我需要的。最好的部分是Spark PCA的实现，所以我决定这样做。

double[][] array=new double[381][744];
        int contor=0;
        for (Vector vectorData : parsedTrainingData.collect()) {
            contor++;
            array[contor]=vectorData.toArray();
        }

        LinkedList<Vector> rowsList = new LinkedList<>();
        for (int i = 0; i < array.length; i++) {
            Vector currentRow = Vectors.dense(array[i]);
            rowsList.add(currentRow);
        }
        JavaRDD<Vector> rows = jsc.parallelize(rowsList);

        // Create a RowMatrix from JavaRDD<Vector>.
        RowMatrix mat = new RowMatrix(rows.rdd());

        // Compute the top 3 principal components.
        Tuple2<Matrix, Vector> pc = mat.computePrincipalComponentsAndExplainedVariance(*param*);

        RowMatrix projected = mat.multiply(pc._1);
        // $example off$
        Vector[] collectPartitions = (Vector[]) projected.rows().collect();
        System.out.println("Projected vector of principal component:");
        for (Vector vector : collectPartitions) {
            System.out.println("\t" + vector);
        }
        System.out.println("\n Explanend Variance:");
        System.out.println(pc._2);
        double sum = 0;

        for (Double val : pc._2.toArray()) {
            sum += val;

        }
        System.out.println("\n the sum is: " + (double) sum);

关于我想要应用的数据PCA我有744个功能，它们代表每小时由家中的传感器收集的值（活动时间的总秒数），所以它类似于（31个传感器* 24小时），格式（s（sensorNumber）（小时）：s10，s11 ..... s123，s20，s21 .... 223，..... s3123。

根据我的理解，减少的标准之一是不丢失大部分信息的是解释方差的总和大于0.9（90％）。经过一些测试后我得到了这个结果：

因此，根据我的理解，将我的744特征向量减少到100个特征向量是安全的。我的问题是这个结果看起来很好。我搜索一些例子来获得指导，但我仍然不确定我所做的是否正确。这样的结果是否合理？

Spark Principal组件分析（PCA）预期结果

0 个答案: