scala count groupBy结果

时间:2015-04-07 14:09:51

标签: scala group-by rdd

我是Scala的新手,我有以下问题:

假设我已使用以下格式解析文本文件:

RDD:[A , Apple, a fruit]
    [A , America, a country]
    [A , Africa, a continent]
    [B , Brazil, a country]

我希望groupBy字母sso我有类似的东西:

[A, Apple, a fruit, America, a country, Africa a continent]

然后我想接受这个和groupBy本身,以便它给我字符串和每个字母的计数:

[A, a, 3
    fruit, 1
    America, 1
    Africa, 1]

到目前为止,我的代码看起来像这样:

val joined = parsed.groupBy(_._2).map((x)=>(x._1,x._2.map((y)=> y._1).toSeq))
val listtext = joined.map(x => x._2)
val idfs = listtext.groupBy(l => l).map(t => (t._1, t._2.size))

但结果是错误的,它似乎将listtext作为一个字符串,这导致计数始终为1.

2 个答案:

答案 0 :(得分:2)

如果我的方程式正确,你可以这样做:

  val conf = new SparkConf().setMaster("local[2]").setAppName("test")
  val sc = new SparkContext(conf)

  case class Input(letter: Char, text: String)
  case class Count(word: String, cnt: Long)
  case class Output(letter: Char, words: Seq[Count])

  val input: RDD[Input] = sc.parallelize(Seq(
    Input('A', "Apple, a fruit"),
    Input('A', "America, a country"),
    Input('A', "Africa, a continent"),
    Input('B', "Brazil, a country")
  ))

  val output: RDD[Output] = input.groupBy(_.letter) map  { case (letter, in) =>
    val cnt = collection.mutable.Map.empty[String, Long]
    in.flatMap(_.text.replaceAll(",", "").split(" ").toSeq).foreach { word =>
      cnt.put(word, cnt.getOrElse(word, 0l) + 1)
    }
    val words = cnt map { case (word, n) => Count(word, n) }
    Output(letter, words.toSeq.sortBy(_.cnt).reverse)
  }

  output.collect().foreach(println)

输出:

Output(B,ArrayBuffer(Count(a,1), Count(country,1), Count(Brazil,1)))
Output(A,ArrayBuffer(Count(a,3), Count(Apple,1), Count(Africa,1), Count(country,1), Count(continent,1), Count(fruit,1), Count(America,1)))

答案 1 :(得分:1)

您可以简单地链接一些转换/组合来实现此目的:

val arr = Array(("A" , "Apple", "a fruit"),
                ("A" , "America", "a country"),
                ("A" , "Africa", "a continent"),
                ("B" , "Brazil", "a country"))

val rdd = sc.parallelize(arr)

val r1 = rdd.groupBy(x => x._1) 
// (A => ((A, Apple, a fruit), (A, America, a country), ...)           

val r2 = r1.map{ case (x, y) => (x, y.flatMap{ case (a, b, c) => Array(b, c) }
                                     .mkString(" ")) }
// A => "Apple a fruit America a country Africa a continent", B => ...

r2.map { case (x, y) => (x, y.split(" ")
                            .groupBy(x => x)
                            .map{ case (a, b) => (a, b.size) })}
  .collect

// Array((A, Map(fruit -> 1, a -> 3, country -> 1, Apple -> 1, continent -> 1, America -> 1, Africa -> 1)), (B,Map(Brazil -> 1, country -> 1, a -> 1)))