我是Scala的新手,我有以下问题:
假设我已使用以下格式解析文本文件:
RDD:[A , Apple, a fruit]
[A , America, a country]
[A , Africa, a continent]
[B , Brazil, a country]
我希望groupBy
字母sso我有类似的东西:
[A, Apple, a fruit, America, a country, Africa a continent]
然后我想接受这个和groupBy
本身,以便它给我字符串和每个字母的计数:
[A, a, 3
fruit, 1
America, 1
Africa, 1]
到目前为止,我的代码看起来像这样:
val joined = parsed.groupBy(_._2).map((x)=>(x._1,x._2.map((y)=> y._1).toSeq))
val listtext = joined.map(x => x._2)
val idfs = listtext.groupBy(l => l).map(t => (t._1, t._2.size))
但结果是错误的,它似乎将listtext
作为一个字符串,这导致计数始终为1.
答案 0 :(得分:2)
如果我的方程式正确,你可以这样做:
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val sc = new SparkContext(conf)
case class Input(letter: Char, text: String)
case class Count(word: String, cnt: Long)
case class Output(letter: Char, words: Seq[Count])
val input: RDD[Input] = sc.parallelize(Seq(
Input('A', "Apple, a fruit"),
Input('A', "America, a country"),
Input('A', "Africa, a continent"),
Input('B', "Brazil, a country")
))
val output: RDD[Output] = input.groupBy(_.letter) map { case (letter, in) =>
val cnt = collection.mutable.Map.empty[String, Long]
in.flatMap(_.text.replaceAll(",", "").split(" ").toSeq).foreach { word =>
cnt.put(word, cnt.getOrElse(word, 0l) + 1)
}
val words = cnt map { case (word, n) => Count(word, n) }
Output(letter, words.toSeq.sortBy(_.cnt).reverse)
}
output.collect().foreach(println)
输出:
Output(B,ArrayBuffer(Count(a,1), Count(country,1), Count(Brazil,1)))
Output(A,ArrayBuffer(Count(a,3), Count(Apple,1), Count(Africa,1), Count(country,1), Count(continent,1), Count(fruit,1), Count(America,1)))
答案 1 :(得分:1)
您可以简单地链接一些转换/组合来实现此目的:
val arr = Array(("A" , "Apple", "a fruit"),
("A" , "America", "a country"),
("A" , "Africa", "a continent"),
("B" , "Brazil", "a country"))
val rdd = sc.parallelize(arr)
val r1 = rdd.groupBy(x => x._1)
// (A => ((A, Apple, a fruit), (A, America, a country), ...)
val r2 = r1.map{ case (x, y) => (x, y.flatMap{ case (a, b, c) => Array(b, c) }
.mkString(" ")) }
// A => "Apple a fruit America a country Africa a continent", B => ...
r2.map { case (x, y) => (x, y.split(" ")
.groupBy(x => x)
.map{ case (a, b) => (a, b.size) })}
.collect
// Array((A, Map(fruit -> 1, a -> 3, country -> 1, Apple -> 1, continent -> 1, America -> 1, Africa -> 1)), (B,Map(Brazil -> 1, country -> 1, a -> 1)))