Question

在寻找想法时，我发现https://stackoverflow.com/a/54222447/264822是一个zip文件，我认为这是一个非常聪明的解决方案。但这取决于具有Central Directory的zip文件-tar文件没有。

我认为我可以遵循相同的一般原则，并通过fileobj参数将S3文件公开给tarfile：

import boto3
import io
import tarfile

class S3File(io.BytesIO):
    def __init__(self, bucket_name, key_name, s3client):
        super().__init__()
        self.bucket_name = bucket_name
        self.key_name = key_name
        self.s3client = s3client
        self.offset = 0

    def close(self):
        return

    def read(self, size):
        print('read: offset = {}, size = {}'.format(self.offset, size))
        start = self.offset
        end = self.offset + size - 1
        try:
            s3_object = self.s3client.get_object(Bucket=self.bucket_name, Key=self.key_name, Range="bytes=%d-%d" % (start, end))
        except:
            return bytearray()
        self.offset = self.offset + size
        result = s3_object['Body'].read()
        return result

    def seek(self, offset, whence=0):
        if whence == 0:
            print('seek: offset {} -> {}'.format(self.offset, offset))
            self.offset = offset

    def tell(self):
        return self.offset

s3file = S3File(bucket_name, file_name, s3client)
tarf = tarfile.open(fileobj=s3file)
names = tarf.getnames()
for name in names:
    print(name)

这工作正常，除了输出如下所示：

read: offset = 0, size = 2
read: offset = 2, size = 8
read: offset = 10, size = 8192
read: offset = 8202, size = 1235
read: offset = 9437, size = 1563
read: offset = 11000, size = 3286
read: offset = 14286, size = 519
read: offset = 14805, size = 625
read: offset = 15430, size = 1128
read: offset = 16558, size = 519
read: offset = 17077, size = 573
read: offset = 17650, size = 620
(continued)

tarfile无论如何都只是读取整个文件，所以我什么也没得到。无论如何，要使tarfile只读取它需要的文件部分吗？我能想到的唯一替代方法是重新实现tar文件解析，这样：

读取512字节的标头并将其写入BytesIO缓冲区。
获取后面文件的大小，并将零写入BytesIO缓冲区。
将文件跳过到下一个标题。

但这似乎太复杂了。

Answer 1

我的错误。我实际上正在处理tar.gz文件，但我假设zip和tar.gz类似。它们不是-tar是一个存档文件，然后将其压缩为gzip，因此要读取tar，您必须先将其解压缩。我从tar文件中提取位的想法行不通。

起作用的是：

<?php
session_start();

if ( isset( $_SESSION['user_id'] ) ) {
    // do something here
} else {
    // Redirect them to the login page
     header("Location: login.php");
    exit();
}
?>

我怀疑原始代码可用于tar文件，但我没有任何尝试。

Answer 2

我刚刚在 tar 文件上测试了您的原始代码，效果很好。

这是我的示例输出（已截断）。我做了一些小改动以显示下载的总字节数和以 kB 为单位的搜索步长（发布于 this gist）。这是一个包含 321 个文件的 1 GB tar 文件（每个文件的平均大小为 3 MB）：

read: offset = 0, size = 2, total download = 2
seek: offset 2 -> 0 (diff = -1 kB)
read: offset = 0, size = 8192, total download = 8194
seek: offset 8192 -> 0 (diff = -9 kB)
read: offset = 0, size = 8192, total download = 16386
seek: offset 8192 -> 0 (diff = -9 kB)
read: offset = 0, size = 512, total download = 16898
<TarInfo 'yt.txt' at 0x7fbbed639ef0>
seek: offset 512 -> 7167 (diff = 6 kB)
read: offset = 7167, size = 1, total download = 16899
read: offset = 7168, size = 512, total download = 17411
<TarInfo 'yt_cache/youtube-sigfuncs' at 0x7fbbed639e20>
read: offset = 7680, size = 512, total download = 17923

...

<TarInfo 'yt_vids/whistle_dolphins-SZTC_zT9ijg.m4a' at 0x7fbbecc697a0>
seek: offset 1004473856 -> 1005401599 (diff = 927 kB)
read: offset = 1005401599, size = 1, total download = 211778
read: offset = 1005401600, size = 512, total download = 212290
None
322

因此，这会为 1GB tar 文件下载 212 kB，以便在与存储桶相同区域的 colab 上大约 2 分钟和在 ec2 上大约 1.5 分钟内获得 321 个文件名的列表。

相比之下，在 colab 上下载完整文件需要 17 秒，使用 tar -tf file.tar 列出其中的文件需要 1 秒。因此，如果我要优化执行时间，我宁愿下载完整文件并在本地进行处理。否则，可能会在您的原始代码中进行一些优化？身份证。

OTOH，如果在 tar 的开头，获取单个文件比上面的 2 分钟更有效，但如果在结尾，则与获取所有文件名一样慢。但是我不能用 getmember() 函数做到这一点，因为它似乎在内部调用 getmembers() ，它必须遍历整个文件。相反，我推出了自己的 while 循环来查找文件并在找到后中止搜索：

bucket_name, file_name = "bucket", "file.tar"

import boto3
s3client = boto3.client("s3")
s3file = S3File(bucket_name, file_name, s3client)

import tarfile
with tarfile.open(mode="r", fileobj=s3file) as tarf:
    tarinfo = 1 # dummy
    while tarinfo is not None:
      tarinfo = tarf.next()
      if tarinfo.name == name_search:
        break

我认为未来的方向是让 tarinfo.open(...) 缓存每个文件的偏移量，以便后续 tarinfo.open(...) 不会再次遍历整个文件。完成后，第一次通过 tar 文件将允许从 s3 中的 tar 下载单个文件，而无需为到达文件一遍又一遍地遍历完整文件。

旁注，你不能在 tar.gz 上运行 gunzip 来测试 tar 吗？

如何在不下载AWS S3的tar中列出文件？

2 个答案: