没有

Question

我有一个.tgz文件，大小为2GB。

我想从.tgz文件中只提取一个大小为2KB的.txt文件。

我有以下代码：

import tarfile
from contextlib import closing

with closing(tarfile.open("myfile.tgz")) as tar:
    subdir_and_files = [
        tarinfo for tarinfo in tar.getmembers()
        if tarinfo.name.startswith("myfile/first/second/text.txt")
        ]
    print subdir_and_files
    tar.extractall(members=subdir_and_files)

问题是我得到提取的文件至少需要一分钟。似乎extractall提取所有文件，但只保存我提出的文件。

有更有效的方法来实现吗？

Answer 1

没有

tar格式不适合快速提取单个文件。在大多数情况下，这种情况会恶化，因为tar文件通常是压缩流。我建议7z。

是的，有点。

如果您知道只有一个文件具有该名称，或者您只想要一个文件，则可以在第一次点击后中止提取过程。

e.g。

完全扫描物品。

$ time tar tf /var/log/apache2/old/2016.tar.xz 
2016/
2016/access.log-20161023
2016/access.log-20160724
2016/ssl_access.log-20160711
2016/error.log-20160815
(...)
2016/error.log-20160918
2016/ssl_request.log-20160814
2016/access.log-20161017
2016/access.log-20160516
time: Real 0m1.5s  User 0m1.4s  System 0m0.2s

从内存中扫描物品

$ time tar tf /var/log/apache2/old/2016.tar.xz  > /dev/null 
time: Real 0m1.3s  User 0m1.2s  System 0m0.2s

在第一个文件

之后中止

$ time tar tf /var/log/apache2/old/2016.tar.xz  | head -n1 
2016/
time: Real 0m0.0s  User 0m0.0s  System 0m0.0s

三个文件后中止

$ time tar tf /var/log/apache2/old/2016.tar.xz  | head -n3 
2016/
2016/access.log-20161023
2016/access.log-20160724
time: Real 0m0.0s  User 0m0.0s  System 0m0.0s

在＆＃34;中间＆＃34;

中的某个文件后中止

$ time tar xf /var/log/apache2/old/2016.tar.xz  2016/access.log-20160724  | head -n1 
time: Real 0m0.9s  User 0m0.9s  System 0m0.1s

在＆＃34;底部＆＃34;

的某个文件后中止

$ time tar xf /var/log/apache2/old/2016.tar.xz  2016/access.log-20160516  | head -n1 
time: Real 0m1.1s  User 0m1.1s  System 0m0.2s

我在这里告诉你，如果你通过退出第一行（head -n1）之后终止GNU tar的输出管道（标准输出），那么tar进程也会死掉。

你可以看到，阅读整个档案文件需要花费更多的时间，而不是在某个文件靠近＆＃34;底部＆＃34;存档。您还可以看到在顶部遇到文件后中止读取所花费的时间要少得多。

如果可以决定存档的格式，我不会这样做。

Soo ......

而不是python-people非常喜欢的列表理解事物，迭代tar.getmembers()（或者在该库中一次给你一个文件），当你遇到你想要的结果而不是将所有文件扩展到列表中。

从.tar-archive有效地提取单个文件

1 个答案:

没有

是的，有点。