Apache spark识别类似的文档

时间:2017-11-30 06:52:35

标签: apache-spark apache-spark-mllib tf-idf apache-spark-ml cosine-similarity

我想从集合中获取类似的文档。

示例文本在下面提供

car killed cat                 
Train killed cat                
john plays cricket             
tom like mangoes 

我希望“汽车死猫”和“火车杀猫”被确定为类似文件

我已使用以下代码

对文本进行了标记,删除了停用词并计算了IDF
 // TOKENIZE DATA

            regexTokenizer = new RegexTokenizer()
                      .setInputCol("text")
                      .setOutputCol("words")
                      .setPattern("\\W"); 

            DataFrame tokenized = regexTokenizer.transform(trainingRiskData);

    // REMOVE STOP WORDS

            remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered");

            DataFrame stopWordsRemoved = remover.transform(tokenized);

// COMPUTE TERM FREQUENCY USING HASHING

        int numFeatures = 20;
        hashingTF = new HashingTF().setInputCol("filtered").setOutputCol("rawFeatures")
                .setNumFeatures(numFeatures);
        DataFrame rawFeaturizedData = hashingTF.transform(stopWordsRemoved);

    IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
        idfModel = idf.fit(rawFeaturizedData);

        DataFrame featurizedData = idfModel.transform(rawFeaturizedData);

这是我的最终数据框架的样子

+---+------------------+----------------------+----------------------+-----------------------------+---------------------------------------------------------------------------+
|id |text              |words                 |filtered              |rawFeatures                  |features                                                                   |
+---+------------------+----------------------+----------------------+-----------------------------+---------------------------------------------------------------------------+
|1  |car killed cat    |[car, killed, cat]    |[car, killed, cat]    |(50,[10,12,13],[1.0,1.0,1.0])|(50,[10,12,13],[0.9162907318741551,0.5108256237659907,0.22314355131420976])|
|2  |Train killed cat  |[train, killed, cat]  |[train, killed, cat]  |(50,[12,13,42],[1.0,1.0,1.0])|(50,[12,13,42],[0.5108256237659907,0.22314355131420976,0.9162907318741551])|
|3  |john plays cricket|[john, plays, cricket]|[john, plays, cricket]|(50,[1,5,13],[1.0,1.0,1.0])  |(50,[1,5,13],[0.5108256237659907,0.9162907318741551,0.22314355131420976])  |
|4  |tom like mangoes  |[tom, like, mangoes]  |[tom, like, mangoes]  |(50,[1,18,26],[1.0,1.0,1.0]) |(50,[1,18,26],[0.5108256237659907,0.9162907318741551,0.9162907318741551])  |
+---+------------------+----------------------+----------------------+-----------------------------+---------------------------------------------------------------------------+

我从下面的链接中了解到,我可以计算余弦相似度 找到两个向量之间的相似性

https://github.com/goldshtn/spark-workshop/blob/master/python/lab7-plagiarism.md

我的要求不同。我想在文档集中识别类似的文档

我想知道以下解决方案是否有帮助

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

我已使用下面的代码将此数据帧转换为RowMatrix并调用cosineSimilarity,然后我得到一个CoordinateMatrix作为结果

JavaRDD<Vector> tempRDD = featurizedData.select("text", "features").toJavaRDD()
                .map(new Function<Row, Vector>() {

                    @Override
                    public Vector call(Row arg0) throws Exception {
                        org.apache.spark.mllib.linalg.Vector v = (org.apache.spark.mllib.linalg.Vector) arg0.get(1) ;
                        return v ;
                    }
                });

        RowMatrix rowMatrix = new RowMatrix(tempRDD.rdd());

        CoordinateMatrix matchingData =  rowMatrix.columnSimilarities(0.8);

CoordinateMatrix是MatrixEntry的集合。

以下是CoordinateMatrix

MatrixEntry(10,13,0.5773502691896257)
MatrixEntry(10,13,0.5773502691896257)
MatrixEntry(12,42,0.7071067811865476)
MatrixEntry(1,13,0.408248290463863)
MatrixEntry(1,18,0.7071067811865476)
MatrixEntry(1,5,0.7071067811865476)
MatrixEntry(18,26,1.0)
MatrixEntry(5,13,0.5773502691896257)
MatrixEntry(1,26,0.7071067811865476)
MatrixEntry(12,13,0.816496580927726)
MatrixEntry(10,12,0.7071067811865476)

我该如何阅读这个矩阵?

如果我的方法完全不正确,请告诉我

0 个答案:

没有答案