标量中单词对的相对频率

时间:2018-09-19 02:48:46

标签: scala

scala是我的新手,这可能就是为什么会引起这些小疑问的原因。 我有一些元组,例如("The", "band"),("The", "show"),("done", "by"),("The", "band"),("done", "that")出现了2次,而以单词“ The”开头的对的数量为3

因此,该对(频段)的相对频率

  • 2/3 = 0.66

所以我最终想要的东西看起来像这样((The, band),0.66) ((The, show), 0.33) ((done, by), 0.5) ((done, that), 0.5)


到目前为止,我所做的是-我的变量items1包含了我上面提到的所有上述对,

val result = items1.map(x=>(x->1)).reduceByKey(_+_)

这给了我类似的东西-((The, band), 2) ((The, show), 1) ((done, by), 1) ((done, that), 1)

现在,我还希望对以单词“ The”或“ done”开头的对进行计数,以便我可以应用除法运算。我能够在一个单独的变量中找到从第一个单词开始的对的计数,但随后无法对其进行除法。

4 个答案:

答案 0 :(得分:2)

这将起作用:

def calcFreqs(xs: List[(String, String)]): Seq[((String, String), Double)] = {
  val den = xs.groupBy(_._1).mapValues(_.length)   // Map(word1, counts)
  xs.groupBy(identity)                           
    .mapValues(_.length)                           // Map(pair, counts)
    .toSeq                                         // Seq(pair, counts)
    .map{ case ((word1, word2), num) => 
      ((word1, word2), num.toDouble / den(word1))} // Seq(pair, pair/word1 ratio) 
}

答案 1 :(得分:1)

鉴于您尝试使用reduceByKey,假设您要处理的数据集是Spark RDD。这是一种使用groupByKey并将结果Map值分组以计算单个单词出现百分比的方法:

val rdd = sc.parallelize(Seq(
  ("The", "band"), ("The", "show"), ("done", "by"), ("The", "band"), ("done", "that")
))

rdd.groupByKey.mapValues{ arr =>
    arr.groupBy(identity).mapValues(_.size.toDouble / arr.size).toSeq
  }.
  flatMap{ case (k, vs) => vs.map(v => ((k, v._1), v._2)) }.
  collect
// res1: Array[((String, String), Double)] = Array(
//  ((The,band),0.66), ((The,show),0.33), ((done,that),0.5), ((done,by),0.5)
// )

如果是普通的Scala集合,则reduceByKeygroupByKey都不是有效的方法。使用groupBy的解决方案将是相似的,但是由于其方法签名与RDD的groupByKey不同而略有不同:

val list = List(
  ("The", "band"), ("The", "show"), ("done", "by"), ("The", "band"), ("done", "that")
)

list.groupBy(_._1).mapValues{ ls =>
    ls.groupBy(identity).mapValues(_.size.toDouble / ls.size)
  }.
  flatMap(_._2).toList
// res1: List[((String, String), Double)] = List(
//   ((done,by),0.5), ((done,that),0.5), ((The,band),0.66), ((The,show),0.33)
// )

答案 2 :(得分:1)

给出元组列表:

val items =List(("The","band"),("The","show"),("done","by"),("The","band"),("done","that"))

使用:

 def  rFreq(items:List[(String,String)]) = {
 val a1 = items.groupBy(identity).map(x=>(x._1,x._2.size))
 val a2 = items.groupBy(_._1).map(x=>(x._1,x._2.size))
 a1.map(x=>(x._1,x._2*1.0/a2.get(x._1._1).get))
 }

在Scala REPL中:

scala> rFreq(items)
res99: scala.collection.immutable.Map[(String, String),Double] = Map((The,band) -> 0.6666666666666666, (The,show) -> 0.33333
33333333333, (done,by) -> 0.5, (done,that) -> 0.5)

答案 3 :(得分:1)

您首先需要将所需的数字计算到Map中,以便您可以在Constant的时间内查询它们。这样,您可以在O(n)时间内获得最终结果。

val items = List(("The","band"),("The","show"),("done","by"),("The","band"),("done","that"))
// items: List[(String, String)] = List((The,band), (The,show), (done,by), (The,band), (done,that))

val firstWordCountMap = items.foldLeft(Map.empty[String, Int])({case (accMap, (first, second)) =>
  accMap + (first -> (accMap.getOrElse(first, 0) + 1))
})
// firstWordCountMap: scala.collection.immutable.Map[String,Int] = Map(The -> 3, done -> 2)

val itemsCountMap = items.foldLeft(Map.empty[(String, String), Int])({case (accMap, item) =>
  accMap + (item -> (accMap.getOrElse(item, 0) + 1))
})
// itemsCountMap: scala.collection.immutable.Map[(String, String),Int] = Map((The,band) -> 2, (The,show) -> 1, (done,by) -> 1, (done,that) -> 1)

val itemsRatioList = itemsCountMap.map({ case ((first, second), count) =>
  ((first, second), count.toDouble / firstWordCountMap(first))
}).toList
// itemsRatio: List[((String, String), Double)] = List(((The,band),0.6666666666666666), ((The,show),0.3333333333333333), ((done,by),0.5), ((done,that),0.5))