Question

我正在尝试以最简单的方式编写一个程序来计算Scala语言中文件中的单词出现次数。到目前为止，我有这些代码：

import scala.io.Codec.string2codec
import scala.io.Source
import scala.reflect.io.File

object WordCounter {
    val SrcDestination: String = ".." + File.separator + "file.txt"
    val Word = "\\b([A-Za-z\\-])+\\b".r

    def main(args: Array[String]): Unit = {

        val counter = Source.fromFile(SrcDestination)("UTF-8")
                .getLines
                .map(l => Word.findAllIn(l.toLowerCase()).toSeq)
                .toStream
                .groupBy(identity)
                .mapValues(_.length)

        println(counter)
    }
}

不要打扰正则表达式。我想知道如何从中提取单个单词在这一行中检索的序列：

map(l => Word.findAllIn(l.toLowerCase()).toSeq)

以便计算每个单词的出现次数。目前我正在使用计算单词序列的地图。

Answer 1

您可以通过使用正则表达式"\\W+"将文件行拆分为单词（flatmap是惰性的，因此不需要将整个文件加载到内存中）。要计算出现次数，您可以将Map[String, Int]折叠为每个单词进行更新（比使用groupBy更多的内存和时间效率）

scala.io.Source.fromFile("file.txt")
  .getLines
  .flatMap(_.split("\\W+"))
  .foldLeft(Map.empty[String, Int]){
     (count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
  }

Answer 2

我认为以下内容稍微容易理解：

Source.fromFile("file.txt").
  getLines().
  flatMap(_.split("\\W+")).
  toList.
  groupBy((word: String) => word).
  mapValues(_.length)

Answer 3

我不是100％肯定你在问什么，但我想我看到了问题。尝试使用flatMap代替map：

flatMap(l => Word.findAllIn(l.toLowerCase()).toSeq)

这会将所有序列连接在一起，以便groupBy对单个单词而不是行级别进行。

关于您的正则表达式的说明

我知道你说不要担心你的正则表达式，但是你可以做一些改变，使它更具可读性。这就是你现在所拥有的：

val Word = "\\b([A-Za-z\\-])+\\b".r

首先，您可以使用Scala的三引号字符串，这样您就不必转义反斜杠：

val Word = """\b([A-Za-z\-])+\b""".r

其次，如果你将-放在角色类的开头，那么你不需要逃避它：

val Word = """\b([-A-Za-z])+\b""".r

Answer 4

这就是我所做的。这会砍掉一个文件。对于高性能而言，Hashmap是一个不错的选择，并且优于任何类型。你可以看到更简洁的排序和切片功能。

import java.io.FileNotFoundException

/**.
 * Cohesive static method object for file handling.
 */
object WordCountFileHandler {

  val FILE_FORMAT = "utf-8"

  /**
   * Take input from file. Split on spaces.
   * @param fileLocationAndName string location of file
   * @return option of string iterator
   */
  def apply (fileLocationAndName: String) : Option[Iterator[String]] = {
    apply (fileLocationAndName, " ")
  }

  /**
   * Split on separator parameter.
   * Speculative generality :P
   * @param fileLocationAndName string location of file
   * @param wordSeperator split on this string
   * @return
   */
  def apply (fileLocationAndName: String, wordSeperator: String): Option[Iterator[String]] = {
    try{
      val words = scala.io.Source.fromFile(fileLocationAndName).getLines() //scala io.Source is a bit hackey. No need to close file.

      //Get rid of anything funky... need the double space removal for files like the README.md...
      val wordList = words.reduceLeft(_ + wordSeperator + _).replaceAll("[^a-zA-Z\\s]", "").replaceAll("  ", "").split(wordSeperator)
      //wordList.foreach(println(_))
      wordList.length match {
        case 0 => return None
        case _ => return Some(wordList.toIterator)
      }
    } catch {
      case _:FileNotFoundException => println("file not found: " + fileLocationAndName); return None
      case e:Exception => println("Unknown exception occurred during file handling: \n\n" + e.getStackTrace); return None
    }
  }
}

import collection.mutable

/**
 * Static method object.
 * Takes a processed map and spits out the needed info
 * While a small performance hit is made in not doing this during the word list analysis,
 * this does demonstrate cohesion and open/closed much better.
 * author: jason goodwin
 */
object WordMapAnalyzer {

  /**
   * get input size
   * @param input
   * @return
   */
  def getNumberOfWords(input: mutable.Map[String, Int]): Int = {
    input.size
  }

  /**
   * Should be fairly logarithmic given merge sort performance is generally about O(6nlog2n + 6n).
   * See below for more performant method.
   * @param input
   * @return
   */

  def getTopCWordsDeclarative(input: mutable.HashMap[String, Int], c: Int): Map[String, Int] = {
    val sortedInput = input.toList.sortWith(_._2 > _._2)
    sortedInput.take(c).toMap
  }

  /**
   * Imperative style is used here for much better performance relative to the above.
   * Growth can be reasoned at linear growth on random input.
   * Probably upper bounded around O(3n + nc) in worst case (ie a sorted input from small to high).
   * @param input
   * @param c
   * @return
   */
  def getTopCWordsImperative(input: mutable.Map[String, Int], c: Int): mutable.Map[String, Int] = {
    var bottomElement: (String, Int) = ("", 0)
    val topList = mutable.HashMap[String, Int]()

    for (x <- input) {
      if (x._2 >= bottomElement._2 && topList.size == c ){
        topList -= (bottomElement._1)
        topList +=((x._1, x._2))
        bottomElement = topList.toList.minBy(_._2)
      } else if (topList.size < c ){
        topList +=((x._1, x._2))
        bottomElement = topList.toList.minBy(_._2)
      }
    }
    //println("Size: " + topList.size)

    topList.asInstanceOf[mutable.Map[String, Int]]
  }
}

object WordMapCountCalculator {

  /**
   * Take a list and return a map keyed by words with a count as the value.
   * @param wordList List[String] to be analysed
   * @return HashMap[String, Int] with word as key and count as pair.
   * */

   def apply (wordList: Iterator[String]): mutable.Map[String, Int] = {
    wordList.foldLeft(new mutable.HashMap[String, Int])((word, count) => {
      word get(count) match{
        case Some(x) => word += (count -> (x+1))   //if in map already, increment count
        case None => word += (count -> 1)          //otherwise, set to 1
      }
    }).asInstanceOf[mutable.Map[String, Int]] 
}

Answer 5

从Scala 2.13开始，除了使用Source检索单词外，我们可以使用groupMapReduce方法，该方法（如其名称所示）等效于groupBy通过mapValues和减少步骤：

import scala.io.Source

Source.fromFile("file.txt")
  .getLines.to(LazyList)
  .flatMap(_.split("\\W+"))
  .groupMapReduce(identity)(_ => 1)(_ + _)

groupMapReduce阶段，类似于Hadoop的映射/归约逻辑，

group自己的单词（身份）（组 MapReduce的组部分）
map将每个分组的单词出现次数设为1（组 Map Reduce的映射部分）
reduce在一组单词（_ + _）中的值，将它们相加（减少groupMap Reduce 的一部分）。

这是one-pass version可以翻译的内容：

seq.groupBy(identity).mapValues(_.map(_ => 1).reduce(_ + _))

还要注意从Iterator到LazyList的转换，以便使用提供groupMapReduce的集合（自开始{{1以来，我们不使用Stream }}，Scala 2.13是LazyList的推荐替代）。

根据相同的原则，也可以使用Stream版本：

for-comprehension

计算文件中单词的最简单方法

5 个答案: