我想从集合中获取类似的文档。
示例文本在下面提供
car killed cat
Train killed cat
john plays cricket
tom like mangoes
我希望“汽车死猫”和“火车杀猫”被确定为类似文件
我已使用以下代码
对文本进行了标记,删除了停用词并计算了IDF // TOKENIZE DATA
regexTokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("words")
.setPattern("\\W");
DataFrame tokenized = regexTokenizer.transform(trainingRiskData);
// REMOVE STOP WORDS
remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered");
DataFrame stopWordsRemoved = remover.transform(tokenized);
// COMPUTE TERM FREQUENCY USING HASHING
int numFeatures = 20;
hashingTF = new HashingTF().setInputCol("filtered").setOutputCol("rawFeatures")
.setNumFeatures(numFeatures);
DataFrame rawFeaturizedData = hashingTF.transform(stopWordsRemoved);
IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
idfModel = idf.fit(rawFeaturizedData);
DataFrame featurizedData = idfModel.transform(rawFeaturizedData);
这是我的最终数据框架的样子
+---+------------------+----------------------+----------------------+-----------------------------+---------------------------------------------------------------------------+
|id |text |words |filtered |rawFeatures |features |
+---+------------------+----------------------+----------------------+-----------------------------+---------------------------------------------------------------------------+
|1 |car killed cat |[car, killed, cat] |[car, killed, cat] |(50,[10,12,13],[1.0,1.0,1.0])|(50,[10,12,13],[0.9162907318741551,0.5108256237659907,0.22314355131420976])|
|2 |Train killed cat |[train, killed, cat] |[train, killed, cat] |(50,[12,13,42],[1.0,1.0,1.0])|(50,[12,13,42],[0.5108256237659907,0.22314355131420976,0.9162907318741551])|
|3 |john plays cricket|[john, plays, cricket]|[john, plays, cricket]|(50,[1,5,13],[1.0,1.0,1.0]) |(50,[1,5,13],[0.5108256237659907,0.9162907318741551,0.22314355131420976]) |
|4 |tom like mangoes |[tom, like, mangoes] |[tom, like, mangoes] |(50,[1,18,26],[1.0,1.0,1.0]) |(50,[1,18,26],[0.5108256237659907,0.9162907318741551,0.9162907318741551]) |
+---+------------------+----------------------+----------------------+-----------------------------+---------------------------------------------------------------------------+
我从下面的链接中了解到,我可以计算余弦相似度 找到两个向量之间的相似性
https://github.com/goldshtn/spark-workshop/blob/master/python/lab7-plagiarism.md
我的要求不同。我想在文档集中识别类似的文档
我想知道以下解决方案是否有帮助
我已使用下面的代码将此数据帧转换为RowMatrix并调用cosineSimilarity,然后我得到一个CoordinateMatrix作为结果
JavaRDD<Vector> tempRDD = featurizedData.select("text", "features").toJavaRDD()
.map(new Function<Row, Vector>() {
@Override
public Vector call(Row arg0) throws Exception {
org.apache.spark.mllib.linalg.Vector v = (org.apache.spark.mllib.linalg.Vector) arg0.get(1) ;
return v ;
}
});
RowMatrix rowMatrix = new RowMatrix(tempRDD.rdd());
CoordinateMatrix matchingData = rowMatrix.columnSimilarities(0.8);
CoordinateMatrix是MatrixEntry的集合。
以下是CoordinateMatrix
MatrixEntry(10,13,0.5773502691896257)
MatrixEntry(10,13,0.5773502691896257)
MatrixEntry(12,42,0.7071067811865476)
MatrixEntry(1,13,0.408248290463863)
MatrixEntry(1,18,0.7071067811865476)
MatrixEntry(1,5,0.7071067811865476)
MatrixEntry(18,26,1.0)
MatrixEntry(5,13,0.5773502691896257)
MatrixEntry(1,26,0.7071067811865476)
MatrixEntry(12,13,0.816496580927726)
MatrixEntry(10,12,0.7071067811865476)
我该如何阅读这个矩阵?
如果我的方法完全不正确,请告诉我