使用Spark创建IndexedRowMatrix时重复列

时间:2017-09-03 09:29:07

标签: java apache-spark rdd

我需要计算几个文档之间的成对相似性。为此,我推荐如下:

JavaPairRDD<String,String> files = sc.wholeTextFiles(file_path);
    System.out.println(files.count()+"**");
    JavaRDD<Row> rowRDD = files.map((Tuple2<String, String> t) -> {
        return RowFactory.create(t._1,t._2.replaceAll("[^\\w\\s]+","").replaceAll("\\d", ""));
    });
    StructType schema = new StructType(new StructField[]{
            new StructField("id", DataTypes.StringType, false, Metadata.empty()),
            new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
        });
    Dataset<Row> rows = spark.createDataFrame(rowRDD, schema);


    Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
    Dataset<Row> tokenized_rows = tokenizer.transform(rows);


    StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered_words");
    Dataset<Row> filtred_rows = remover.transform(tokenized_rows);

    CountVectorizerModel cvModel = new CountVectorizer().setInputCol("filtered_words").setOutputCol("rowF").setVocabSize(100000).fit(filtred_rows);
    Dataset<Row> verct_rows = cvModel.transform(filtred_rows);
    IDF idf = new IDF().setInputCol("rowF").setOutputCol("features");
    IDFModel idfModel = idf.fit(verct_rows);
    Dataset<Row> rescaledData = idfModel.transform(verct_rows);
JavaRDD<IndexedRow> vrdd = rescaledData.toJavaRDD().map((Row r) -> {
        //DenseVector dense;
        String s = r.getAs(0);
        int index = new Integer(s.replace(s.substring(0,24),"").replace(s.substring(s.indexOf(".txt")),""));
        SparseVector sparse = (SparseVector) r.getAs(5);
        //dense = sparse.toDense();parseVector) r.getAs(5);
        org.apache.spark.mllib.linalg.Vector vec =             org.apache.spark.mllib.linalg.Vectors.dense(sparse.toDense().toArray());
        return new IndexedRow(index, vec);
    });


System.out.println(vrdd.count()+"---");
    IndexedRowMatrix mat = new IndexedRowMatrix(vrdd.rdd());
    System.out.println(mat.numCols()+"---"+mat.numRows());

不幸的是,结果表明,即使我的数据集包含3个文档,IndexedRowMatrix也会创建4列(第一列是重复的)。

3**
3--
1106---4

你能帮助我找出这种重复的原因吗?

1 个答案:

答案 0 :(得分:0)

很可能根本没有重复,你的数据根本不遵循规范,这要求索引是连续的,从零开始,整数。因此SELECT product, inventory FROM InventoryEvents.std:groupwin(name).win:time_length_batch(1 min, 1000) .std:win:lastevent(); numRows

max(row.index for row in rows) + 1