Question

我正在尝试评估哪种数据结构最能代表Scala中的稀疏向量。这些稀疏向量包含索引列表，每个索引包含一个值。我实现了一个小的基准测试，这似乎表明Array[(Long, Double)]似乎占用的空间比2个并行数组少得多。那是对的吗？我正确地做了那个基准吗？（如果我在某处做错了，我不会感到惊讶）

import java.lang.management.ManagementFactory
import java.text.NumberFormat

object TestSize {

  val N = 100000000
  val formatter: NumberFormat = java.text.NumberFormat.getIntegerInstance

  def twoParallelArrays(): Unit = {

    val Z1 = Array.ofDim[Long](N)
    val Z2 = Array.ofDim[Double](N)
    Z1(N-1) = 1
    Z2(N-1) = 1.0D
    println(Z2(N-1) - Z1(N-1))
    val z1 = ManagementFactory.getMemoryMXBean.getHeapMemoryUsage.getUsed
    val z2 = ManagementFactory.getMemoryMXBean.getNonHeapMemoryUsage.getUsed
    println(s"${formatter.format(z1)} ${formatter.format(z2)}")
  }

  def arrayOfTuples(): Unit = {

    val Z = Array.ofDim[(Long, Double)](N)
    Z(N-1) = (1, 1.0D)
    println(Z(N-1)._2 - Z(N-1)._1)
    val z1 = ManagementFactory.getMemoryMXBean.getHeapMemoryUsage.getUsed
    val z2 = ManagementFactory.getMemoryMXBean.getNonHeapMemoryUsage.getUsed
    println(s"${formatter.format(z1)} ${formatter.format(z2)}")
  }

  def main(args: Array[String]): Unit = {

    // Comment one or the other to look at the results
    //arrayOfTuples()
    twoParallelArrays()
  }
}

Answer 1

No, not correct.

Array.ofDim[(Long, Double)](N)

creates a large array filled with null, and does not allocate any space for Long, Double, or the actual Tuple2-instance, that's why you don't see anything in heap-memory usage.

The two-array version allocates all the space it needs for all Long and Double immediately, and you see it in heap space usage.

Just replace ofDim by an appropriate fill to see the real numbers.

On array of size N = 1000000:

arrayOfTuples:     45,693,312 19,190,296
twoParallelArrays: 25,925,792 19,315,256

The arrayOfTuples-solution clearly takes more space.

You might wonder why the factor is roughly 1.8 instead of at least 2.5. This is because Tuple2 is @specialized for a few primitive datatypes, especially for Long and Double, therefore these two 8-byte primitives can be stored in Tuple2 without boxing. Therefore, the total overhead is only 8 bytes for a 64-bit reference from array to Tuple2, and some overhead in each Tuple2 instance. But still, it's more than storing Long and Double directly in arrays.

By the way: that's exactly the reason why Apache Spark stores the data using all those Encoders: so that you don't have to worry about the overhead caused by your tuples and case-classes.

Full code:

import java.lang.management.ManagementFactory
import java.text.NumberFormat

object TestSize {

  val N = 1000000
  val formatter: NumberFormat = java.text.NumberFormat.getIntegerInstance

  def twoParallelArrays(): Unit = {

    val Z1 = Array.fill[Long](N)(42L)
    val Z2 = Array.fill[Double](N)(42.0)
    println(Z1)
    println(Z2)
    Z1(N-1) = 1
    Z2(N-1) = 1.0D
    println(Z2(N-1) - Z1(N-1))
    val z1 = ManagementFactory.getMemoryMXBean.getHeapMemoryUsage.getUsed
    val z2 = ManagementFactory.getMemoryMXBean.getNonHeapMemoryUsage.getUsed
    Z1((new scala.util.Random).nextInt(N)) = 1234L
    Z2((new scala.util.Random).nextInt(N)) = 345.0d
    println(Z2(N-1) - Z1(N-1))
    println(s"${formatter.format(z1)} ${formatter.format(z2)}")
  }

  def arrayOfTuples(): Unit = {

    val Z = Array.fill[(Long, Double)](N)((42L, 42.0d))
    Z(N-1) = (1, 1.0D)
    println(Z(N-1)._2 - Z(N-1)._1)
    val z1 = ManagementFactory.getMemoryMXBean.getHeapMemoryUsage.getUsed
    val z2 = ManagementFactory.getMemoryMXBean.getNonHeapMemoryUsage.getUsed
    Z((new scala.util.Random).nextInt(N)) = (1234L, 543.0d)
    println(Z(N-1)._2 - Z(N-1)._1)
    println(s"${formatter.format(z1)} ${formatter.format(z2)}")
  }

  def main(args: Array[String]): Unit = {

    // Comment one or the other to look at the results
    arrayOfTuples()
    // twoParallelArrays()
  }
}

试图了解Scala数组

1 个答案: