我想为文本中出现的每个唯一单词获取数字标识符。
为此目的,我已经编写了这个函数,用于将单词存储在mutable.Map
中var dict = scala.collection.mutable.Map[String,Int]()
var i = 0
def addToDict(line:String) = {
var words = line.split(' ') //returns String[]
for(w <- words) {
if(!(dict.contains(w))) {
dict.put(w, i)
i = i+1
}
}
}
longtext.collect().foreach(addToDict) //returns the text line by line, where each line contains a few words
是一个mutable.Map为此目的最好的集合,还是有更好的集合?
答案 0 :(得分:4)
另一种依赖索引压缩的方法,
def addToDict(line: String) =
line.split("\\W+").distinct.zipWithIndex.toMap
注意\\W+
将一行分为单词。
因此
addToDict("the text line by line")
res: Map(the -> 0, text -> 1, line -> 2, by -> 3)
<强>更新强>
对于给定的文本文件,请考虑这一点,
implicit class RichFile(val filename: String) extends AnyVal {
def toDict() = {
val words = io.Source.fromFile(filename).getLines.flatMap(_.split("\\W+")).toSeq
words.distinct.zipWithIndex.toMap
}
}
像这样使用,
"longTextFilename".toDict()
答案 1 :(得分:2)
折叠肯定会更加惯用,您可以使用distinct
只考虑一次单词:
def addToDict(line: String) =
line.split(' ').distinct.foldLeft((0, Map[String, Int]())){
case ((i, m), s) => (i + 1, m + (s -> i))
}._2
例如
addToDict("a few words and another few words")
// Map(a -> 0, few -> 1, words -> 2, and -> 3, another -> 4)
答案 2 :(得分:1)
对于这样的,可变结构不是必需的。我更喜欢这样的东西:
def addToDict(line: String): Map[Int, String] =
line.split(' '). // 1. split words
foldLeft(0 -> Map.empty[String, Int]) { (st, w) => // 2. will fill the dict
val (i, m): (Int, Map[String, Int]) = st // current state
// determine next state...
if (!m.contains(w)) {
val j = i+1 // new num id
j -> (m + (w, j)) // updated state
} else i -> m // unchanged stated
}