Question

我是Scala的新手，希望构建一个与某些人匹配的实时应用程序。对于一个特定的人，我希望得到匹配得分最高的前50名。

成语如下：

val persons = new mutable.HashSet[Person]() // Collection of people
/* Feed omitted */
val personsPar = persons.par // Make it parall
val person = ... // The given person

res = personsPar
        .filter(...) // Some filters
        .map{p => (p,computeMatchingScoreAsFloat(person, p))}
        .toList
        .sortBy(-_._2)
        .take(50)
        .map(t => t._1 + "=" + t._2).mkString("\n")

在上面的示例代码中，使用了HashSet，但它可以是任何类型的集合，因为我很确定它不是最佳的

问题是人员包含超过5M的元素，computeMatchingScoreAsFloatméthods计算一个具有2个200浮点向量的相关值。这个计算在我的计算机上需要大约2秒，有6个核心。

我的问题是，在Scala中执行此TOPN模式的最快方法是什么？

子请求： - 我应该使用什么样的集合（或其他东西？）？ - 我应该使用期货吗？

注意：它必须并行计算，单独的computeMatchingScoreAsFloat的纯计算（没有排名/ TOP N）需要超过一秒，并且＆lt;如果我的计算机上有多线程，则为200毫秒

编辑：感谢Guillaume，计算时间从2秒减少到700毫秒

def top[B](n:Int,t: Traversable[B])(implicit ord: Ordering[B]):collection.mutable.PriorityQueue[B] = {

  val starter = collection.mutable.PriorityQueue[B]()(ord.reverse) // Need to reverse for us to capture the lowest (of the max) or the greatest (of the min)

  t.foldLeft(starter)(
    (myQueue,a) => {
      if( myQueue.length <= n ){ myQueue.enqueue(a);myQueue}
      else if( ord.compare(a,myQueue.head) < 0  ) myQueue
      else{
        myQueue.dequeue
        myQueue.enqueue(a)
        myQueue
      }
    }
  )
}

由于

Answer 1

我会提出一些改变：

1-我认为过滤器和地图步骤需要遍历集合两次。拥有一个懒惰的集合会将它减少到一个。有一个惰性集合（如Stream）或将其转换为一个，例如列表：

myList.view

2-排序步骤需要对所有元素进行排序。相反，您可以使用存储前N个记录的累加器来执行FoldLeft。在那里查看一个实现示例： Simplest way to get the top n elements of a Scala Iterable。如果你想要最大的性能（真的落入它的驾驶室），我可能会测试优先级队列而不是列表。例如，像这样：

  def IntStream(n:Int):Stream[(Int,Int)] = if(n == 0) Stream.empty else (util.Random.nextInt,util.Random.nextInt) #:: IntStream(n-1)

  def top[B](n:Int,t: Traversable[B])(implicit ord: Ordering[B]):collection.mutable.PriorityQueue[B] = {

    val starter = collection.mutable.PriorityQueue[B]()(ord.reverse) // Need to reverse for us to capture the lowest (of the max) or the greatest (of the min)

    t.foldLeft(starter)(
      (myQueue,a) => {
        if( myQueue.length <= n ){ myQueue.enqueue(a);myQueue}
        else if( ord.compare(a,myQueue.head) < 0  ) myQueue
        else{
          myQueue.dequeue
          myQueue.enqueue(a)
          myQueue
        }
      }
    )
  }

def diff(t2:(Int,Int)) =  t2._2
 top(10,IntStream(10000))(Ordering.by(diff)) // select top 10

我认为您的问题需要 SINGLE 集合遍历，因此您可以将运行时间降至1秒以下

祝你好运！

在Scala中并行迭代集合的最有效方法是什么（TOP N模式）

1 个答案: