Question

我正在尝试使用Scala API下载文件，但如果该文件太大（50MB），我希望它能够中止。

我已经设法将一种非常低效的方法放在一起，适用于小文件（＆lt; 10KB），但是我的CPU通过屋顶运行大文件：

var size = 0
val bytes = scala.io.Source.fromURL(url)(scala.io.Codec.ISO8859).toStream.map {
  c =>
    size = size + 1
    if (size > (maxMbSize*1024*1024)) {
      throw new Exception(s"File size is greater than the maximum allowed size of $maxMbSize MB")
    }
    c.toByte
}.toArray

我希望能够更有效地执行此检查，并避免使用var作为大小。这可能吗？

此外，我正在使用播放框架，以防有人知道该框架中的API可能会执行我正在寻找的内容。

Answer 1

您根本不需要将数据加载到字节数组中 - 您可以使用现有的Java库使用DigestInputStream动态生成哈希。在这个例子中，我从String加载数据，但您可以适应从URL加载。我们使用尾递归函数来消除var，并返回Option，以便我们可以通过返回None来指示超大文件。

import java.io._
import java.security._
import scala.annotation.tailrec

def calculateHash(algorithm: MessageDigest, in: String, limit: Int): Option[Array[Byte]] = {

  val input = new ByteArrayInputStream(in.getBytes())
  val dis = new DigestInputStream(input, algorithm)

  @tailrec
  def read(total: Int): Option[Array[Byte]] = {
    if (total > limit) None
    else {
      val byte = dis.read()
      if (byte == -1) Some(algorithm.digest())
      else read(total + 1)
    }
  }
  read(0)
}

使用示例：

val sha1 = MessageDigest.getInstance("SHA1") 

calculateHash(sha1, "Hello", 5).get             

//> res0: Array[Byte] = Array(-9, -1, -98, -117, 123, -78, -32, -101, 112, -109, 90, 93, 120, 94, 12, -59, -39, -48, -85, -16)

calculateHash(sha1, "Too long!!!", 5)           

//> res1: Option[Array[Byte]] = None

通过使用使用缓冲区的DigestInputStream.read()变体，您也可以获得更好的性能：

...
val buffer = new Array[Byte](1024)

@tailrec
def read(total: Int): Option[Array[Byte]] = {
  if (total > limit) None
  else {
    val count = dis.read(buffer, 0, buffer.length)
    if (count == -1) Some(algorithm.digest())
    else read(total + count)
  }
}
....

Answer 2

由于您将所有数据都实现到内存中，因此您不需要以小块（这是scala io库正在执行的操作）缓冲它。此外，既然你想要字节，你不需要将字节解码为字符只是为了反转过程。

要丢失你的大小var，你可以使用zipWithIndex函数，它将每个元素与它的索引配对。请注意，它从0开始，因此您需要+ 1。

def readMyURL(url: String): Array[Byte] = {
    val is = new java.net.URL(url).openStream()
    val byteArray = Iterator.continually(is.read).zipWithIndex.takeWhile{
        zb =>
            if (zb._2 > (maxMbSize*1024*1024) + 1) {
                throw new Exception(s"File size is greater than the maximum allowed size of $maxMbSize MB")
            }
            -1 != zb._1  // -1 is the end of stream
    }.map(_._1.toByte).toArray
    is.close()
    byteArray
}

这是懒惰的，所以在你调用theArray之前迭代器不会遍历。

您可以通过不关闭该URL InputStream来逃避（看起来scala io库不会这样做）。

如果Scala中的文件太大，请中止下载

2 个答案: