Question

我想使用Google提供的Java库com.google.cloud.storage从Google云端存储中下载大文件。我有工作代码，但我仍有一个问题和一个主要问题：

我主要担心的是，文件内容何时实际下载？在storage.get(blobId)期间（{1}}期间或blob.reader()期间（参考下面的代码）reader.read(bytes)期间？这对于如何处理无效校验和非常重要，我需要做些什么来实际触发文件再次通过网络获取？

更简单的问题是：是否有内置功能可以对谷歌库中收到的文件进行md5（或crc32c）检查？也许我不需要自己实施它。

以下是我尝试从Google云端存储下载大文件的方法：

private static final int MAX_NUMBER_OF_TRIES = 3;
public Path downloadFile(String storageFileName, String bucketName) throws IOException {
    // In my real code, this is a field populated in the constructor.
    Storage storage = Objects.requireNonNull(StorageOptions.getDefaultInstance().getService());

    BlobId blobId = BlobId.of(bucketName, storageFileName);
    Path outputFile = Paths.get(storageFileName.replaceAll("/", "-"));
    int retryCounter = 1;
    Blob blob;
    boolean checksumOk;
    MessageDigest messageDigest;
    try {
        messageDigest = MessageDigest.getInstance("MD5");
    } catch (NoSuchAlgorithmException ex) {
        throw new RuntimeException(ex);
    }

    do {
        LOGGER.debug("Start download file {} from bucket {} to Content Store (try {})", storageFileName, bucketName, retryCounter);
        blob = storage.get(blobId);
        if (null == blob) {
            throw new CloudStorageCommunicationException("Failed to download file after " + retryCounter + " tries.");
        }
        if (Files.exists(outputFile)) {
            Files.delete(outputFile);
        }
        try (ReadChannel reader = blob.reader();
             FileChannel channel = new FileOutputStream(outputFile.toFile(), true).getChannel()) {
            ByteBuffer bytes = ByteBuffer.allocate(128 * 1024);
            int bytesRead = reader.read(bytes);
            while (bytesRead > 0) {
                bytes.flip();
                messageDigest.update(bytes.array(), 0, bytesRead);
                channel.write(bytes);
                bytes.clear();
                bytesRead = reader.read(bytes);
            }
        }
        String checksum = Base64.encodeBase64String(messageDigest.digest());
        checksumOk = checksum.equals(blob.getMd5());
        if (!checksumOk) {
            Files.delete(outputFile);
            messageDigest.reset();
        }
    } while (++retryCounter <= MAX_NUMBER_OF_TRIES && !checksumOk);
    if (!checksumOk) {
        throw new CloudStorageCommunicationException("Failed to download file after " + MAX_NUMBER_OF_TRIES + " tries.");
    }
    return outputFile;
}

Answer 1

google-cloud-java存储库在读取超出正常HTTPS / TCP正确性检查范围的数据时，不会自行验证校验和。如果它将接收数据的MD5与已知的MD5进行比较，则需要先下载整个文件，然后才能返回lsof -n -i:80 | grep LISTEN的任何结果，这对于非常大的文件来说是不可行的。

如果您需要额外的MD5比较保护，那么您正在做的事情是个好主意。如果这是一次性任务，您可以使用read()命令行工具，该工具执行相同类型的额外检查。

Answer 2

正如ReadChannel的JavaDoc所说：

此类的实现可以在内部缓冲数据以减少远程调用。

因此，从blob.reader()获得的实现可以缓存整个文件，一些字节或什么都没有，只需在调用read()时获取字节。你永远不会知道，你也不应该关心。

由于只有read()会引发IOException，而您使用的其他方法却没有，我说只有调用read()才能实际下载内容。您也可以在lib的the sources中看到这一点。

顺便说一下。尽管库的JavaDocs中有示例，但您应该检查>= 0，而不是> 0。 0只表示没有读取任何内容，而不是达到了流的末尾。通过返回-1来发信号通知流的结尾。

要在校验和检查失败后重试，请从blob中获取新的阅读器。如果有什么东西缓存下载的数据，那么读者本身。因此，如果您从blob获得一个新的阅读器，该文件将从远程重新加载。

如何使用带有校验和控制的Java从Google Cloud Storage下载大文件

2 个答案: