Question

我想比较Scala中的两个文本并计算相似率。我开始编码，但我被阻止了：

import org.apache.spark._
import org.apache.spark.SparkContext._


object WordCount {

    def main(args: Array[String]):Unit = {
       val white = "/whiteCat.txt" // "The white cat is eating a white soup"
       val black  = "/blackCat.txt" // "The black cat is eating a white sandwich"
       val conf = new SparkConf().setAppName("wordCount")
       val sc = new SparkContext(conf)
       val b =  sc.textFile(white)
       val words = b.flatMap(line => line.split("\\W+"))
       val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
       counts.take(10).foreach(println)
       //counts.saveAsTextFile(outputFile)
       }    
    }

我成功地分割了每个文本的单词并计算每个单词的出现次数。例如，在file1中有：

(The,1)
(white,2)
(cat,1)
(is,1)
(eating,1)
(a,1)
(soup,1)

计算相似率。我必须做这个算法，但我没有Scala经验

i=0
foreach word in the first text
   j = 0
   IF keyFile1[i] == keyFile2[j]
       THEN MIN(valueFile1[i], valueFile2[j]) / MAX(valueFile1[i], valueFile2[j])
   ELSE j++
   i++

你能帮我吗？

Answer 1

您可以使用leftOuterJoin加入两个键/值对RDD以生成Array[(String, (Int, Option[Int]))]类型的RDD，从元组中收集两个计数，将计数展平为Int类型，并应用您的最小/最大公式，如下例所示：

val wordCountsWhite = sc.textFile("/path/to/whitecat.txt").
  flatMap(_.split("\\W+")).
  map((_, 1)).
  reduceByKey(_ + _)

wordCountsWhite.collect
// res1: Array[(String, Int)] = Array(
//   (is,1), (eating,1), (cat,1), (white,2), (The,1), (soup,1), (a,1)
// )

val wordCountsBlack = sc.textFile("/path/to/blackcat.txt").
  flatMap(_.split("\\W+")).
  map((_, 1)).
  reduceByKey(_ + _)

wordCountsBlack.collect
// res2: Array[(String, Int)] = Array(
//   (is,1), (eating,1), (cat,1), (white,1), (The,1), (a,1), (sandwich,1), (black,1)
// )

val similarityRDD = wordCountsWhite.leftOuterJoin(wordCountsBlack).map{
  case (k: String, (c1: Int, c2: Option[Int])) => {
    val counts = Seq(Some(c1), c2).flatten
    (k, counts.min.toDouble / counts.max )
  }
}

similarityRDD.collect
// res4: Array[(String, Double)] = Array(
//   (is,1.0), (eating,1.0), (cat,1.0), (white,0.5), (The,1.0), (soup,1.0), (a,1.0)
// )

Answer 2

这似乎很容易用于理解

 for( a <- counts1; b <- counts2 if a._1==b._1 ) yield Math.min(a._2,b._2)/Math.max(a._2,b._2)

编辑：对不起，上面的代码不起作用。这是一个修改后的代码，用于理解。 counts1和count2是问题中的2个计数。

val result= for( (t1,t2) <- counts1.cartesian(counts2) if( t1._1==t2._1)) yield Math.min(t1._2,t2._2).toDouble/Math.max(t1._2,t2._2).toDouble

结果： result.foreach（的println） 1.0 0.5 1.0 1.0 1.0

Answer 3

有许多算法可以找到字符串之间的相似性。其中一种方法是edit distance。编辑距离有不同的定义，并且基于该方法有不同的操作集。但主要的想法是找到最小的一系列操作（插入，删除，替换）将字符串 a 转换为字符串 b 。

Levenshtein distance和Longest Common Subsequence是广为人知的算法，用于查找字符串之间的相似性。但是这些方法对上下文不敏感。由于这种情况，您可能需要查看this article，其中表示了n-gram相似度和距离。您还可以在github或rosetta代码中找到这些算法的Scala实现。

我希望它有所帮助！

比较Scala

3 个答案: