我需要计算几个文档之间的成对相似性。为此,我推荐如下:
JavaPairRDD<String,String> files = sc.wholeTextFiles(file_path);
System.out.println(files.count()+"**");
JavaRDD<Row> rowRDD = files.map((Tuple2<String, String> t) -> {
return RowFactory.create(t._1,t._2.replaceAll("[^\\w\\s]+","").replaceAll("\\d", ""));
});
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> rows = spark.createDataFrame(rowRDD, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> tokenized_rows = tokenizer.transform(rows);
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered_words");
Dataset<Row> filtred_rows = remover.transform(tokenized_rows);
CountVectorizerModel cvModel = new CountVectorizer().setInputCol("filtered_words").setOutputCol("rowF").setVocabSize(100000).fit(filtred_rows);
Dataset<Row> verct_rows = cvModel.transform(filtred_rows);
IDF idf = new IDF().setInputCol("rowF").setOutputCol("features");
IDFModel idfModel = idf.fit(verct_rows);
Dataset<Row> rescaledData = idfModel.transform(verct_rows);
JavaRDD<IndexedRow> vrdd = rescaledData.toJavaRDD().map((Row r) -> {
//DenseVector dense;
String s = r.getAs(0);
int index = new Integer(s.replace(s.substring(0,24),"").replace(s.substring(s.indexOf(".txt")),""));
SparseVector sparse = (SparseVector) r.getAs(5);
//dense = sparse.toDense();parseVector) r.getAs(5);
org.apache.spark.mllib.linalg.Vector vec = org.apache.spark.mllib.linalg.Vectors.dense(sparse.toDense().toArray());
return new IndexedRow(index, vec);
});
System.out.println(vrdd.count()+"---");
IndexedRowMatrix mat = new IndexedRowMatrix(vrdd.rdd());
System.out.println(mat.numCols()+"---"+mat.numRows());
不幸的是,结果表明,即使我的数据集包含3个文档,IndexedRowMatrix也会创建4列(第一列是重复的)。
3**
3--
1106---4
你能帮助我找出这种重复的原因吗?
答案 0 :(得分:0)
很可能根本没有重复,你的数据根本不遵循规范,这要求索引是连续的,从零开始,整数。因此SELECT product, inventory FROM
InventoryEvents.std:groupwin(name).win:time_length_batch(1 min, 1000)
.std:win:lastevent();
为numRows
max(row.index for row in rows) + 1