Question

我想比较Scala中的两个文本并计算相似率。我有这个，但我没有成功计算for循环中的平均值。我是Scala的新手，我不知道如何在循环中做到这一点

import org.apache.spark._
import org.apache.spark.SparkContext._

object WordCount {
    def main(args: Array[String]):Unit = {
        val conf = new SparkConf().setAppName("WordCount")
        val sc = new SparkContext(conf)
        val wordCounts1 = sc.textFile("/chatblanc.txt"). //The white cat is eating a white soup
        flatMap(_.split("\\W+")).
            map((_, 1)).
            reduceByKey(_ + _)
            wordCounts1.collect.foreach(println)
           //Res : (is,1)                                                                          
           (eating,1)
           (cat,1)
           (white,2)
           (The,1)
           (soup,1)
           (a,1)
    print("======= End first file ========\n")
    val wordCounts2 = sc.textFile("/chatnoir.txt").
     //The black cat is eating a white sandwich
    flatMap(_.split("\\W+")).
        map((_, 1)).
        reduceByKey(_ + _)
        wordCounts2.collect.foreach(println)
        // Res2 : (is,1)                                                                          
        (eating,1)
        (cat,1)
        (white,1)
        (The,1)
        (a,1)
        (sandwich,1)
        (black,1)
    print("======= End second file ========\n")
    print("======= Display similarity rate  ========\n")
    val result = for( (t1,t2) <- wordCounts1.cartesian(wordCounts2) if( t1._1==t2._1)) yield = (Math.min(t1._2,t2._2).toDouble/Math.max(t1._2,t2._2).toDouble)
                            result.collect.foreach(println) 
    //Res : 
    1.0
    1.0
    1.0
    0.5
    1.0
    1.0
    }       
}

最终我们想要的是在变量中存储这6个值的平均值。

你能帮帮我吗？

Answer 1

这里不需要笛卡尔积。只需join：

val rdd1 = sc.parallelize(Seq("The white cat is eating a white soup"))
  .flatMap(_.split("\\s+").map((_, 1.0)))
  .reduceByKey(_ + _)

val rdd2 = sc.parallelize(Seq("The black cat is eating a white sandwich"))
  .flatMap(_.split("\\s+").map((_, 1.0)))
  .reduceByKey(_ + _)

val comb =  rdd1.join(rdd2).mapValues { case (x, y) => x.min(y) / x.max(y) }

现在，如果您想要平均花费values并致电mean：

comb.values.mean

增加平均值scala

1 个答案: