Question

我们有一个要求是将Blob Container中的大型.zip文件（大小约为3-4 GB）提取到其他Blob容器中，解压后的文件是Jason文件（大小约为35 -50GB）。

对于实现，已经从此链接引用了代码：https://msdevzone.wordpress.com/2017/07/07/extract-a-zip-file-stored-in-azure-blob/并且能够在几分钟内解压缩大小为40MB的文件解压缩到400MB，但是在2 GB文件大小提取到30GB JSON文件时卡住了一个多小时。 / p>

有人可以建议他们在这种情况下是否有更好的解决方案而不使用文件操作？

请在下面的代码参考我们的工作：

CloudBlockBlob blockBlob = container.GetBlockBlobReference(filename);
BlobRequestOptions options = new BlobRequestOptions();
options.ServerTimeout = new TimeSpan(0, 20, 0);

// Save blob(zip file) contents to a Memory Stream.
using (MemoryStream zipBlobFileStream = new MemoryStream())
{
    //blockBlob.Properties.LeaseDuration
    blockBlob.DownloadToStream(zipBlobFileStream, null, options);
    zipBlobFileStream.Flush();
    zipBlobFileStream.Position = 0;
    //use ZipArchive from System.IO.Compression to extract all the files from zip file
    using (ZipArchive zip = new ZipArchive(zipBlobFileStream, ZipArchiveMode.Read, true))
    {
        //Each entry here represents an individual file or a folder
        foreach (var entry in zip.Entries)
        {
            //creating an empty file (blobkBlob) for the actual file with the same name of file
            var blob = extractcontainer.GetBlockBlobReference(entry.FullName);
            using (var stream = entry.Open())
            {
                //check for file or folder and update the above blob reference with actual content from stream
                if (entry.Length > 0)
                    blob.UploadFromStream(stream);
            }
        }
    }
}

Answer 1

您引用的方法将不起作用，因为它使用内存流，并且下一行将所有数据加载到内存中时，会导致内存不足。

blob.DownloadToStream(memoryStream);

要解决此问题，我按照https://discord.js.org/#/docs/main/stable/class/Client的说明进行操作。我对代码所做的唯一更改是在该行中添加await

await blockBlob.UploadFromStreamAsync(fileStream);

希望这会有所帮助。

Answer 2

使用Azure存储文件共享，这是它对我有用的唯一方式，无需将整个ZIP加载到内存中。我使用3GB ZIP文件（包含数千个文件或一个大文件）进行了测试，并且内存/ CPU较低且稳定。也许您可以适应BlockBlob。希望对您有所帮助！

var zipFiles = _directory.ListFilesAndDirectories()
    .OfType<CloudFile>()
    .Where(x => x.Name.ToLower().Contains(".zip"))
    .ToList();

foreach (var zipFile in zipFiles)
{
    using (var zipArchive = new ZipArchive(zipFile.OpenRead()))
    {
        foreach (var entry in zipArchive.Entries)
        {
            if (entry.Length > 0)
            {
                CloudFile extractedFile = _directory.GetFileReference(entry.Name);

                using (var entryStream = entry.Open())
                {
                    byte[] buffer = new byte[16 * 1024];
                    using (var ms = extractedFile.OpenWrite(entry.Length))
                    {
                        int read;
                        while ((read = entryStream.Read(buffer, 0, buffer.Length)) > 0)
                        {
                            ms.Write(buffer, 0, read);
                        }
                    }
                }
            }
        }
    }               
}

Answer 3

如果您需要解压缩Azure Storage中的大量文件，则可以选择使用 Azure Batch 。

Azure Batch使您能够在云中高效地运行大规模并行和高性能计算（HPC）应用程序。

它将为您管理计算集群，您需要担心的是创建逻辑并提交批处理服务以便跨节点执行。

您可以使用 Stream功能将blob下载为流，使用 ZipArchive 类将其解压缩，然后将其上传到输出容器。

using (Stream memoryStream = new MemoryStream())
      {
          blob.DownloadToStream(memoryStream);
          memoryStream.Position = 0; //Reset the stream

          ZipArchive archive = new ZipArchive(memoryStream);
          Console.WriteLine("Extracting {0} which contains {1} files", blobName, archive.Entries.Count);
          foreach (ZipArchiveEntry entry in archive.Entries)
          {
               CloudBlockBlob blockBlob = outputContainer.GetBlockBlobReference(entry.Name);

               blockBlob.UploadFromStream(entry.Open());
               Console.WriteLine("Uploaded {0}", entry.Name);
          }
      }

有关详细代码，请参阅此sample。

如何在不使用物理文件路径的情况下在blob容器之间解压缩大型zip文件

3 个答案: