Spark

时间:2015-10-14 19:22:13

标签: scala apache-spark

我有一个值列表及其所有出现的聚合长度作为数组。

例如:如果我的判决是

"I have a cat. The cat looks very cute"

我的数组看起来像

Array((I,1), (have,4), (a,1), (cat,6), (The, 3), (looks, 5), (very ,4), (cute,4))

现在我想计算每个单词的平均长度。即发生的长度/次数。

我尝试使用Scala进行编码,如下所示:

val avglen = arr.reduceByKey( (x,y) => (x, y.toDouble / x.size.toDouble) )

我在x.size上收到如下错误消息                                      ^ 错误:值大小不是Int

的成员

请帮助我,我在这里出错了。

此致 VRK

3 个答案:

答案 0 :(得分:0)

如果我理解正确的问题:

val rdd: RDD[(String, Int) = ???
val ave: RDD[(String, Double) = 
     rdd.map { case (name, numOccurance) => 
       (name, name.length.toDouble / numOccurance)
     }

答案 1 :(得分:0)

这是一个有点令人困惑的问题。如果您的数据已经在Array[(String, Int)]集合中(可能是在驱动程序collect()之后),那么您无需使用任何RDD转换。事实上,您可以使用fold*()运行一个非常好的技巧来获取集合的平均值:

val average = arr.foldLeft(0.0) { case (sum: Double, (_, count: Int)) => sum + count } / arr.foldLeft(0.0) { case (sum: Double, (word: String, count: Int)) => sum + count / word.length }

有点长篇大论,但它基本上汇总了分子中的字符总数和分母中的字数。运行您的示例,我看到以下内容:

scala> val arr = Array(("I",1), ("have",4), ("a",1), ("cat",6), ("The", 3), ("looks", 5), ("very" ,4), ("cute",4))
arr: Array[(String, Int)] = Array((I,1), (have,4), (a,1), (cat,6), (The,3), (looks,5), (very,4), (cute,4))

scala> val average = ...
average: Double = 3.111111111111111

如果您在(String, Int)分发了RDD[(String, Int)]个元组,则可以使用accumulators轻松解决此问题:

val chars = sc.accumulator(0.0)
val words = sc.accumulator(0.0)
wordsRDD.foreach { case (word: String, count: Int) =>
  chars += count; words += count / word.length
}

val average = chars.value / words.value

在上面的例子中运行时(放在RDD中),我看到以下内容:

scala> val arr = Array(("I",1), ("have",4), ("a",1), ("cat",6), ("The", 3), ("looks", 5), ("very" ,4), ("cute",4))
arr: Array[(String, Int)] = Array((I,1), (have,4), (a,1), (cat,6), (The,3), (looks,5), (very,4), (cute,4))

scala> val wordsRDD = sc.parallelize(arr)
wordsRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:14

scala> val chars = sc.accumulator(0.0)
chars: org.apache.spark.Accumulator[Double] = 0.0

scala> val words = sc.accumulator(0.0)
words: org.apache.spark.Accumulator[Double] = 0.0

scala> wordsRDD.foreach { case (word: String, count: Int) =>
     |   chars += count; words += count / word.length
     | }
...
scala>     val average = chars.value / words.value
average: Double = 3.111111111111111

答案 2 :(得分:0)

在您发表评论后,我想我明白了:

val words = sc.parallelize(Array(("i", 1), ("have", 4), 
                                 ("a", 1), ("cat", 6), 
                                 ("the", 3), ("looks", 5), 
                                 ("very", 4), ("cute", 4)))

val avgs = words.map { case (word, count) => (word, count / word.length.toDouble) }

println("My averages are: ")
avgs.take(100).foreach(println)

enter image description here

假设您有一个带有这些单词的段落,并且您想要计算该段落的单词的平均大小。

使用map-reduce方法并在spark-1.5.1中执行两个步骤:

val words = sc.parallelize(Array(("i", 1), ("have", 4), 
                                 ("a", 1), ("cat", 6), 
                                 ("the", 3), ("looks", 5), 
                                 ("very", 4), ("cute", 4)))

val wordCount = words.map { case (word, count) => count}.reduce((a, b) => a + b)
val wordLength = words.map { case (word, count) => word.length * count}.reduce((a, b) => a + b)

println("The avg length is: " +  wordLength / wordCount.toDouble)

我使用连接到spark-kernel的.ipynb运行此代码。这是输出。

enter image description here