我正在尝试以最简单的方式编写一个程序来计算Scala语言中文件中的单词出现次数。到目前为止,我有这些代码:
import scala.io.Codec.string2codec
import scala.io.Source
import scala.reflect.io.File
object WordCounter {
val SrcDestination: String = ".." + File.separator + "file.txt"
val Word = "\\b([A-Za-z\\-])+\\b".r
def main(args: Array[String]): Unit = {
val counter = Source.fromFile(SrcDestination)("UTF-8")
.getLines
.map(l => Word.findAllIn(l.toLowerCase()).toSeq)
.toStream
.groupBy(identity)
.mapValues(_.length)
println(counter)
}
}
不要打扰正则表达式。我想知道如何从中提取单个单词 在这一行中检索的序列:
map(l => Word.findAllIn(l.toLowerCase()).toSeq)
以便计算每个单词的出现次数。目前我正在使用计算单词序列的地图。
答案 0 :(得分:33)
您可以通过使用正则表达式"\\W+"
将文件行拆分为单词(flatmap
是惰性的,因此不需要将整个文件加载到内存中)。要计算出现次数,您可以将Map[String, Int]
折叠为每个单词进行更新(比使用groupBy
更多的内存和时间效率)
scala.io.Source.fromFile("file.txt")
.getLines
.flatMap(_.split("\\W+"))
.foldLeft(Map.empty[String, Int]){
(count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
}
答案 1 :(得分:14)
我认为以下内容稍微容易理解:
Source.fromFile("file.txt").
getLines().
flatMap(_.split("\\W+")).
toList.
groupBy((word: String) => word).
mapValues(_.length)
答案 2 :(得分:1)
我不是100%肯定你在问什么,但我想我看到了问题。尝试使用flatMap
代替map
:
flatMap(l => Word.findAllIn(l.toLowerCase()).toSeq)
这会将所有序列连接在一起,以便groupBy
对单个单词而不是行级别进行。
关于您的正则表达式的说明
我知道你说不要担心你的正则表达式,但是你可以做一些改变,使它更具可读性。这就是你现在所拥有的:
val Word = "\\b([A-Za-z\\-])+\\b".r
首先,您可以使用Scala的三引号字符串,这样您就不必转义反斜杠:
val Word = """\b([A-Za-z\-])+\b""".r
其次,如果你将-
放在角色类的开头,那么你不需要逃避它:
val Word = """\b([-A-Za-z])+\b""".r
答案 3 :(得分:1)
这就是我所做的。这会砍掉一个文件。 对于高性能而言,Hashmap是一个不错的选择,并且优于任何类型。 你可以看到更简洁的排序和切片功能。
import java.io.FileNotFoundException
/**.
* Cohesive static method object for file handling.
*/
object WordCountFileHandler {
val FILE_FORMAT = "utf-8"
/**
* Take input from file. Split on spaces.
* @param fileLocationAndName string location of file
* @return option of string iterator
*/
def apply (fileLocationAndName: String) : Option[Iterator[String]] = {
apply (fileLocationAndName, " ")
}
/**
* Split on separator parameter.
* Speculative generality :P
* @param fileLocationAndName string location of file
* @param wordSeperator split on this string
* @return
*/
def apply (fileLocationAndName: String, wordSeperator: String): Option[Iterator[String]] = {
try{
val words = scala.io.Source.fromFile(fileLocationAndName).getLines() //scala io.Source is a bit hackey. No need to close file.
//Get rid of anything funky... need the double space removal for files like the README.md...
val wordList = words.reduceLeft(_ + wordSeperator + _).replaceAll("[^a-zA-Z\\s]", "").replaceAll(" ", "").split(wordSeperator)
//wordList.foreach(println(_))
wordList.length match {
case 0 => return None
case _ => return Some(wordList.toIterator)
}
} catch {
case _:FileNotFoundException => println("file not found: " + fileLocationAndName); return None
case e:Exception => println("Unknown exception occurred during file handling: \n\n" + e.getStackTrace); return None
}
}
}
import collection.mutable
/**
* Static method object.
* Takes a processed map and spits out the needed info
* While a small performance hit is made in not doing this during the word list analysis,
* this does demonstrate cohesion and open/closed much better.
* author: jason goodwin
*/
object WordMapAnalyzer {
/**
* get input size
* @param input
* @return
*/
def getNumberOfWords(input: mutable.Map[String, Int]): Int = {
input.size
}
/**
* Should be fairly logarithmic given merge sort performance is generally about O(6nlog2n + 6n).
* See below for more performant method.
* @param input
* @return
*/
def getTopCWordsDeclarative(input: mutable.HashMap[String, Int], c: Int): Map[String, Int] = {
val sortedInput = input.toList.sortWith(_._2 > _._2)
sortedInput.take(c).toMap
}
/**
* Imperative style is used here for much better performance relative to the above.
* Growth can be reasoned at linear growth on random input.
* Probably upper bounded around O(3n + nc) in worst case (ie a sorted input from small to high).
* @param input
* @param c
* @return
*/
def getTopCWordsImperative(input: mutable.Map[String, Int], c: Int): mutable.Map[String, Int] = {
var bottomElement: (String, Int) = ("", 0)
val topList = mutable.HashMap[String, Int]()
for (x <- input) {
if (x._2 >= bottomElement._2 && topList.size == c ){
topList -= (bottomElement._1)
topList +=((x._1, x._2))
bottomElement = topList.toList.minBy(_._2)
} else if (topList.size < c ){
topList +=((x._1, x._2))
bottomElement = topList.toList.minBy(_._2)
}
}
//println("Size: " + topList.size)
topList.asInstanceOf[mutable.Map[String, Int]]
}
}
object WordMapCountCalculator {
/**
* Take a list and return a map keyed by words with a count as the value.
* @param wordList List[String] to be analysed
* @return HashMap[String, Int] with word as key and count as pair.
* */
def apply (wordList: Iterator[String]): mutable.Map[String, Int] = {
wordList.foldLeft(new mutable.HashMap[String, Int])((word, count) => {
word get(count) match{
case Some(x) => word += (count -> (x+1)) //if in map already, increment count
case None => word += (count -> 1) //otherwise, set to 1
}
}).asInstanceOf[mutable.Map[String, Int]]
}
答案 4 :(得分:0)
从Scala 2.13
开始,除了使用Source
检索单词外,我们可以使用groupMapReduce方法,该方法(如其名称所示)等效于groupBy
通过mapValues
和减少步骤:
import scala.io.Source
Source.fromFile("file.txt")
.getLines.to(LazyList)
.flatMap(_.split("\\W+"))
.groupMapReduce(identity)(_ => 1)(_ + _)
groupMapReduce
阶段,类似于Hadoop的映射/归约逻辑,
group
自己的单词(身份)(组 MapReduce的组部分)
map
将每个分组的单词出现次数设为1(组 Map Reduce的映射部分)
reduce
在一组单词(_ + _
)中的值,将它们相加(减少groupMap Reduce 的一部分)。
这是one-pass version可以翻译的内容:
seq.groupBy(identity).mapValues(_.map(_ => 1).reduce(_ + _))
还要注意从Iterator
到LazyList
的转换,以便使用提供groupMapReduce
的集合(自开始{{1以来,我们不使用Stream
}},Scala 2.13
是LazyList
的推荐替代)。
根据相同的原则,也可以使用Stream
版本:
for-comprehension