我想比较Scala中的两个文本并计算相似率。我有这个,但我没有成功计算for循环中的平均值。我是Scala的新手,我不知道如何在循环中做到这一点
import org.apache.spark._
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]):Unit = {
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val wordCounts1 = sc.textFile("/chatblanc.txt"). //The white cat is eating a white soup
flatMap(_.split("\\W+")).
map((_, 1)).
reduceByKey(_ + _)
wordCounts1.collect.foreach(println)
//Res : (is,1)
(eating,1)
(cat,1)
(white,2)
(The,1)
(soup,1)
(a,1)
print("======= End first file ========\n")
val wordCounts2 = sc.textFile("/chatnoir.txt").
//The black cat is eating a white sandwich
flatMap(_.split("\\W+")).
map((_, 1)).
reduceByKey(_ + _)
wordCounts2.collect.foreach(println)
// Res2 : (is,1)
(eating,1)
(cat,1)
(white,1)
(The,1)
(a,1)
(sandwich,1)
(black,1)
print("======= End second file ========\n")
print("======= Display similarity rate ========\n")
val result = for( (t1,t2) <- wordCounts1.cartesian(wordCounts2) if( t1._1==t2._1)) yield = (Math.min(t1._2,t2._2).toDouble/Math.max(t1._2,t2._2).toDouble)
result.collect.foreach(println)
//Res :
1.0
1.0
1.0
0.5
1.0
1.0
}
}
最终我们想要的是在变量中存储这6个值的平均值。
你能帮帮我吗?
答案 0 :(得分:4)
这里不需要笛卡尔积。只需join
:
val rdd1 = sc.parallelize(Seq("The white cat is eating a white soup"))
.flatMap(_.split("\\s+").map((_, 1.0)))
.reduceByKey(_ + _)
val rdd2 = sc.parallelize(Seq("The black cat is eating a white sandwich"))
.flatMap(_.split("\\s+").map((_, 1.0)))
.reduceByKey(_ + _)
val comb = rdd1.join(rdd2).mapValues { case (x, y) => x.min(y) / x.max(y) }
现在,如果您想要平均花费values
并致电mean
:
comb.values.mean