使用地图

Question

我有一个大文本文件，由Gigaword构建的Word2Vec向量组成（大小超过3GB），每行都是一个单词及其对应的向量。它按频率排序，因此列表中的高频率字高于低频字。

对于给定的单词列表，我需要构建一个包含单词及其word2vec向量的Scala Map。以下是我的方法：

对于每个单词，将文件作为迭代器打开：

val it = scala.io.Source.fromFile(filePath).getLines()
使用find查找匹配的字词，如果找不到则使用默认值：

val match = it.find(_.split(" ").head == word).getOrElse("zzz" 0d)

这是我的完整方法：

def buildArray2b: (Double, Array[(String, breeze.linalg.DenseVector[Double])]) = {
val startAll = System.currentTimeMillis().toDouble
val stream = (for (word <- this.vocabulary.map(each => each.toLowerCase)) yield {
  println("starting " + word)
  val start = System.currentTimeMillis().toDouble
  println("building iterator")
  val iterator = Source.fromInputStream(this.inputStream).getLines()
  println("finding")
  val line = iterator.find(it => it.split(" ").head == word).getOrElse("zzz 0.0")
  println("found")
  val splitLine = line.split(" ")                                                                       //split string into elements
  val tail = splitLine.tail.map(_.toDouble)                                                             //build w2v vector
  val vectorizedLine = splitLine.head -> breeze.linalg.DenseVector(tail)                                //build map entry
  val stop = System.currentTimeMillis().toDouble
  println(word + ":" + (stop - start) / 1000d)
  vectorizedLine
}).toArray
val stopAll = System.currentTimeMillis().toDouble
val elapsed = (stopAll - startAll) / 1000d
(elapsed, stream)

}

以下是有时候找到以下单词的输出＆＃34; a＆＃34; ＆＃34;不切实际＆＃34;和＆＃34;＆＃34;：

scala> w2v.buildArray2
a:0.001
quixotic:0.795
the:25.6

我不知道为什么它没有时间去寻找＆＃34; quixotic＆＃34; （这应该是＃34;远远低于＆＃34;在列表中与＆＃34; a＆＃34;＆＃34;＆＃34;＆＃34;），但永远找到单词＆＃34;＆＃34; ;

我对数据结构的经验很少，所以我很感激（1）对这个问题的任何见解，以及（2）关于如何使这个过程更有效的任何建议。

为此，我已尝试过以下方法：

将整个文件加载到Map中。这需要很长时间才能首先转换为序列然后转换为地图。
将.txt文件转换为.json，然后使用包（在本例中为json4s）将该.json文件直接打开到地图中。我遇到了内存错误（我已经为此项目分配了14g内存）。

提前感谢任何评论/见解！

Answer 1

使用地图

只要您没有内存问题，使用Map[String,Vector[String]]就是一个很好的选择。读取文件一次并将数据放入Map中。您已经差不多了，因为几乎所有Seq[Tuple2]都可以使用toMap轻松转换为地图。每个密钥都会得到constant time access。

转换为json

这将增加额外的间接步骤。它只会让需要解析和处理的数据增长，并使进程更慢。

迭代器和重用

引用官方Scala文档：http://www.scala-lang.org/api/current/index.html#scala.collection.Iterator They have a hasNext method for checking if there is a next element available, and a next method which returns the next element and discards it from the iterator.因此，根据定义，Iterator不可重复使用。

Answer 2

Scala的Iterator是有状态的，不打算共享或重用。任何可能的共享或重用完全取决于迭代的底层资源。即您可以安全地多次迭代文件，但在下载流上重用迭代器是没有意义的。

你可以在线性时间＆amp;将vocabulary转换为Map或Set，以获得恒定的空间。在较高级别，您需要迭代word2vec文件中的每一行，并检查该单词是否在您的vocabulary中，然后如果它是添加该单词＆amp;像这样的地图重量：

val vocab = this.vocabulary.toSet
val it = Source.fromInputStream(inputStream).getLines
val result: Map[String, DenseVector] = foldLeft(Map.empty[String, DenseVector]){(acc, line) => 
    val Array(word, weight) = line.split(" ")
    if(vocab.contains(word))
         acc + (word -> breeze.linalg.DenseVector(weight.toDouble))
    else 
         acc
 }

foldLeft从迭代器的左侧（即头部）遍历整个文件。在每一行，它检查单词是否是词汇单词，然后如果是，则将其添加到地图（acc），然后在处理完所有迭代器后返回该地图。

反复搜索Scala迭代器的最有效方法

2 个答案:

使用地图

转换为json

迭代器和重用