Scala:标记化和字典

时间:2014-08-15 00:15:36

标签: scala dictionary tokenize

我想为文本中出现的每个唯一单词获取数字标识符。

为此目的,我已经编写了这个函数,用于将单词存储在mutable.Map

var dict = scala.collection.mutable.Map[String,Int]()
var i = 0

def addToDict(line:String) = {
    var words = line.split(' ') //returns String[]
    for(w <- words) {
        if(!(dict.contains(w))) {
            dict.put(w, i)
            i = i+1
        }
    }
}

longtext.collect().foreach(addToDict) //returns the text line by line, where each line contains a few words

是一个mutable.Map为此目的最好的集合,还是有更好的集合?

3 个答案:

答案 0 :(得分:4)

另一种依赖索引压缩的方法,

def addToDict(line: String) = 
  line.split("\\W+").distinct.zipWithIndex.toMap

注意\\W+将一行分为单词。

因此

addToDict("the text line by line")
res: Map(the -> 0, text -> 1, line -> 2, by -> 3)

<强>更新

对于给定的文本文件,请考虑这一点,

implicit class RichFile(val filename: String) extends AnyVal {

  def toDict() = {
    val words = io.Source.fromFile(filename).getLines.flatMap(_.split("\\W+")).toSeq
    words.distinct.zipWithIndex.toMap
  }

}

像这样使用,

"longTextFilename".toDict()

答案 1 :(得分:2)

折叠肯定会更加惯用,您可以使用distinct只考虑一次单词:

def addToDict(line: String) =
  line.split(' ').distinct.foldLeft((0, Map[String, Int]())){
    case ((i, m), s) => (i + 1, m + (s -> i))
  }._2

例如

addToDict("a few words and another few words")
// Map(a -> 0, few -> 1, words -> 2, and -> 3, another -> 4)

答案 2 :(得分:1)

对于这样的,可变结构不是必需的。我更喜欢这样的东西:

def addToDict(line: String): Map[Int, String] =
  line.split(' '). // 1. split words
  foldLeft(0 -> Map.empty[String, Int]) { (st, w) => // 2. will fill the dict
    val (i, m): (Int, Map[String, Int]) = st // current state

    // determine next state...
    if (!m.contains(w)) {
      val j = i+1 // new num id
      j -> (m + (w, j)) // updated state
    } else i -> m // unchanged stated
  }