scala
是我的新手,这可能就是为什么会引起这些小疑问的原因。
我有一些元组,例如("The", "band"),("The", "show"),("done", "by"),("The", "band"),("done", "that")
出现了2
次,而以单词“ The
”开头的对的数量为3
。
因此,该对(频段)的相对频率为
2/3 = 0.66
所以我最终想要的东西看起来像这样((The, band),0.66) ((The, show), 0.33) ((done, by), 0.5) ((done, that), 0.5)
。
到目前为止,我所做的是-我的变量items1
包含了我上面提到的所有上述对,
val result = items1.map(x=>(x->1)).reduceByKey(_+_)
这给了我类似的东西-((The, band), 2) ((The, show), 1) ((done, by), 1) ((done, that), 1)
。
现在,我还希望对以单词“ The”或“ done”开头的对进行计数,以便我可以应用除法运算。我能够在一个单独的变量中找到从第一个单词开始的对的计数,但随后无法对其进行除法。
答案 0 :(得分:2)
这将起作用:
def calcFreqs(xs: List[(String, String)]): Seq[((String, String), Double)] = {
val den = xs.groupBy(_._1).mapValues(_.length) // Map(word1, counts)
xs.groupBy(identity)
.mapValues(_.length) // Map(pair, counts)
.toSeq // Seq(pair, counts)
.map{ case ((word1, word2), num) =>
((word1, word2), num.toDouble / den(word1))} // Seq(pair, pair/word1 ratio)
}
答案 1 :(得分:1)
鉴于您尝试使用reduceByKey
,假设您要处理的数据集是Spark RDD。这是一种使用groupByKey
并将结果Map值分组以计算单个单词出现百分比的方法:
val rdd = sc.parallelize(Seq(
("The", "band"), ("The", "show"), ("done", "by"), ("The", "band"), ("done", "that")
))
rdd.groupByKey.mapValues{ arr =>
arr.groupBy(identity).mapValues(_.size.toDouble / arr.size).toSeq
}.
flatMap{ case (k, vs) => vs.map(v => ((k, v._1), v._2)) }.
collect
// res1: Array[((String, String), Double)] = Array(
// ((The,band),0.66), ((The,show),0.33), ((done,that),0.5), ((done,by),0.5)
// )
如果是普通的Scala集合,则reduceByKey
和groupByKey
都不是有效的方法。使用groupBy
的解决方案将是相似的,但是由于其方法签名与RDD的groupByKey
不同而略有不同:
val list = List(
("The", "band"), ("The", "show"), ("done", "by"), ("The", "band"), ("done", "that")
)
list.groupBy(_._1).mapValues{ ls =>
ls.groupBy(identity).mapValues(_.size.toDouble / ls.size)
}.
flatMap(_._2).toList
// res1: List[((String, String), Double)] = List(
// ((done,by),0.5), ((done,that),0.5), ((The,band),0.66), ((The,show),0.33)
// )
答案 2 :(得分:1)
给出元组列表:
val items =List(("The","band"),("The","show"),("done","by"),("The","band"),("done","that"))
使用:
def rFreq(items:List[(String,String)]) = {
val a1 = items.groupBy(identity).map(x=>(x._1,x._2.size))
val a2 = items.groupBy(_._1).map(x=>(x._1,x._2.size))
a1.map(x=>(x._1,x._2*1.0/a2.get(x._1._1).get))
}
在Scala REPL中:
scala> rFreq(items)
res99: scala.collection.immutable.Map[(String, String),Double] = Map((The,band) -> 0.6666666666666666, (The,show) -> 0.33333
33333333333, (done,by) -> 0.5, (done,that) -> 0.5)
答案 3 :(得分:1)
您首先需要将所需的数字计算到Map
中,以便您可以在Constant
的时间内查询它们。这样,您可以在O(n)
时间内获得最终结果。
val items = List(("The","band"),("The","show"),("done","by"),("The","band"),("done","that"))
// items: List[(String, String)] = List((The,band), (The,show), (done,by), (The,band), (done,that))
val firstWordCountMap = items.foldLeft(Map.empty[String, Int])({case (accMap, (first, second)) =>
accMap + (first -> (accMap.getOrElse(first, 0) + 1))
})
// firstWordCountMap: scala.collection.immutable.Map[String,Int] = Map(The -> 3, done -> 2)
val itemsCountMap = items.foldLeft(Map.empty[(String, String), Int])({case (accMap, item) =>
accMap + (item -> (accMap.getOrElse(item, 0) + 1))
})
// itemsCountMap: scala.collection.immutable.Map[(String, String),Int] = Map((The,band) -> 2, (The,show) -> 1, (done,by) -> 1, (done,that) -> 1)
val itemsRatioList = itemsCountMap.map({ case ((first, second), count) =>
((first, second), count.toDouble / firstWordCountMap(first))
}).toList
// itemsRatio: List[((String, String), Double)] = List(((The,band),0.6666666666666666), ((The,show),0.3333333333333333), ((done,by),0.5), ((done,that),0.5))