我有一个文本变量,它是scala
中String的RDDval data = sc.parallelize(List("i am a good boy.Are you a good boy.","You are also working here.","I am posting here today.You are good."))
我在Scala Map中有另一个变量(如下所示)
//需要找到doc count的单词列表,初始doc count为1
val dictionary = Map( """good""" -> 1,"""working""" -> 1,"""posting""" -> 1 ).
我想对每个字典术语进行文档计数,并以键值格式获取输出
对于上述数据,我的输出应如下所示。
(good,2)
(working,1)
(posting,1)
我试过的是
dictionary.map { case(k,v) => k -> k.r.findFirstIn(data.map(line => line.trim()).collect().mkString(",")).size}
我对所有单词的计数为1。
请帮我解决上述问题
提前致谢。
答案 0 :(得分:1)
为什么不使用flatMap创建字典,然后可以查询。
val dictionary = data.flatMap {case line => line.split(" ")}.map {case word => (word, 1)}.reduceByKey(_+_)
如果我在REPL中收集这个,我得到以下结果:
res9: Array[(String, Int)] = Array((here,1), (good.,1), (good,2), (here.,1), (You,1), (working,1), (today.You,1), (boy.Are,1), (are,2), (a,2), (posting,1), (i,1), (boy.,1), (also,1), (I,1), (am,2), (you,1))
显然你需要做一个比我简单例子更好的分裂。
答案 1 :(得分:1)
首先,你的词典应该是一个Set,因为一般来说你需要将术语集映射到包含它们的文档数。
所以你的数据应该是这样的:
scala> val docs = List("i am a good boy.Are you a good boy.","You are also working here.","I am posting here today.You are good.")
docs: List[String] = List(i am a good boy.Are you a good boy., You are also working here., I am posting here today.You are good.)
您的词典应如下所示:
scala> val dictionary = Set("good", "working", "posting")
dictionary: scala.collection.immutable.Set[String] = Set(good, working, posting)
然后你必须实现你的转换,它可能是contains
函数最简单的逻辑:
scala> dictionary.map(k => k -> docs.count(_.contains(k))) toMap
res4: scala.collection.immutable.Map[String,Int] = Map(good -> 2, working -> 1, posting -> 1)
为了更好的解决方案,我建议您根据自己的要求实施特定功能
(String,String)=>布尔
确定文档中是否存在该术语:
scala> def foo(doc: String, term: String): Boolean = doc.contains(term)
foo: (doc: String, term: String)Boolean
然后最终解决方案将如下:
scala> dictionary.map(k => k -> docs.count(d => foo(d, k))) toMap
res3: scala.collection.immutable.Map[String,Int] = Map(good -> 2, working -> 1, posting -> 1)
您要做的最后一件事是使用SparkContext计算结果映射。首先,您必须定义要并行化的数据。假设我们想要并行化文档集合,那么解决方案可能如下:
val docsRDD = sc.parallelize(List(
"i am a good boy.Are you a good boy.",
"You are also working here.",
"I am posting here today.You are good."
))
docsRDD.mapPartitions(_.map(doc => dictionary.collect {
case term if doc.contains(term) => term -> 1
})).map(_.toMap) reduce { case (m1, m2) => merge(m1, m2) }
def merge(m1: Map[String, Int], m2: Map[String, Int]) =
m1 ++ m2 map { case (k, v) => k -> (v + m1.getOrElse(k, 0)) }