我将索引存储在磁盘上的压缩zip中,并希望从此zip文件中提取单个文件。在python中执行此操作似乎非常慢,是否可以解决此问题。
with zipfile.ZipFile("testoutput/index_doc.zip", mode='r') as myzip:
with myzip.open("c0ibtxf_i.txt") as mytxt:
txt = mytxt.read()
txt = codecs.decode(txt, "utf-8")
print(txt)
我使用的是python代码。在python中运行此脚本需要很长时间
python3 testunzip.py 1.22s user 0.06s system 98% cpu 1.303 total
这很烦人,特别是因为我知道它可以更快:
unzip -p testoutput/index_doc.zip c0ibtxf_i.txt 0.01s user 0.00s system 69% cpu 0.023 total
根据要求:分析
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.051 0.051 1.492 1.492 <string>:1(<module>)
127740 0.043 0.000 0.092 0.000 cp437.py:14(decode)
1 0.000 0.000 1.441 1.441 testunzip.py:69(toprofile)
1 0.000 0.000 0.000 0.000 threading.py:72(RLock)
1 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
1 0.000 0.000 0.000 0.000 zipfile.py:1065(__enter__)
1 0.000 0.000 0.000 0.000 zipfile.py:1068(__exit__)
1 0.692 0.692 1.441 1.441 zipfile.py:1085(_RealGetContents)
1 0.000 0.000 0.000 0.000 zipfile.py:1194(getinfo)
1 0.000 0.000 0.000 0.000 zipfile.py:1235(open)
1 0.000 0.000 0.000 0.000 zipfile.py:1591(__del__)
2 0.000 0.000 0.000 0.000 zipfile.py:1595(close)
2 0.000 0.000 0.000 0.000 zipfile.py:1713(_fpclose)
1 0.000 0.000 0.000 0.000 zipfile.py:191(_EndRecData64)
1 0.000 0.000 0.000 0.000 zipfile.py:234(_EndRecData)
127739 0.180 0.000 0.220 0.000 zipfile.py:320(__init__)
127739 0.046 0.000 0.056 0.000 zipfile.py:436(_decodeExtra)
1 0.000 0.000 0.000 0.000 zipfile.py:605(_check_compression)
1 0.000 0.000 0.000 0.000 zipfile.py:636(_get_decompressor)
1 0.000 0.000 0.000 0.000 zipfile.py:654(__init__)
3 0.000 0.000 0.000 0.000 zipfile.py:660(read)
1 0.000 0.000 0.000 0.000 zipfile.py:667(close)
1 0.000 0.000 0.000 0.000 zipfile.py:708(__init__)
1 0.000 0.000 0.000 0.000 zipfile.py:821(read)
1 0.000 0.000 0.000 0.000 zipfile.py:854(_update_crc)
1 0.000 0.000 0.000 0.000 zipfile.py:901(_read1)
1 0.000 0.000 0.000 0.000 zipfile.py:937(_read2)
1 0.000 0.000 0.000 0.000 zipfile.py:953(close)
1 0.000 0.000 1.441 1.441 zipfile.py:981(__init__)
127740 0.049 0.000 0.049 0.000 {built-in method _codecs.charmap_decode}
1 0.000 0.000 0.000 0.000 {built-in method _codecs.decode}
1 0.000 0.000 0.000 0.000 {built-in method _codecs.utf_8_decode}
127743 0.058 0.000 0.058 0.000 {built-in method _struct.unpack}
127739 0.016 0.000 0.016 0.000 {built-in method builtins.chr}
1 0.000 0.000 1.492 1.492 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
2 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
255484 0.020 0.000 0.020 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {built-in method builtins.max}
1 0.000 0.000 0.000 0.000 {built-in method builtins.min}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
1 0.000 0.000 0.000 0.000 {built-in method io.open}
2 0.000 0.000 0.000 0.000 {built-in method zlib.crc32}
1 0.000 0.000 0.000 0.000 {function ZipExtFile.close at 0x101975620}
127741 0.011 0.000 0.011 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'close' of '_io.BufferedReader' objects}
127740 0.224 0.000 0.317 0.000 {method 'decode' of 'bytes' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
127739 0.024 0.000 0.024 0.000 {method 'find' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
7 0.006 0.001 0.006 0.001 {method 'read' of '_io.BufferedReader' objects}
510956 0.071 0.000 0.071 0.000 {method 'read' of '_io.BytesIO' objects}
8 0.000 0.000 0.000 0.000 {method 'seek' of '_io.BufferedReader' objects}
4 0.000 0.000 0.000 0.000 {method 'tell' of '_io.BufferedReader' objects}
它似乎是在构造函数中发生的事情?我可以以某种方式避免这种开销吗?
答案 0 :(得分:6)
我弄清楚问题是什么:
为了解决这个问题,我改编了python的zipfile源代码。它具有您需要的所有默认功能,但是当您为构造函数提供要提取的文件名列表时,它不会构建整个信息列表。
在特定用例中,您只需要zip中的一些文件,这会对性能和内存使用产生很大影响。
对于上面示例中的特定情况(即从包含128K文件的zip中仅提取一个文件,新实现的速度现在接近解压缩方法的速度)
测试用例:
def original_zipfile():
import zipfile
with zipfile.ZipFile("testoutput/index_doc.zip", mode='r') as myzip:
with myzip.open("c6kn5pu_i.txt") as mytxt:
txt = mytxt.read()
def my_zipfile():
import zipfile2
with zipfile2.ZipFile("testoutput/index_doc.zip", to_extract=["c6kn5pu_i.txt"], mode='r') as myzip:
with myzip.open("c6kn5pu_i.txt") as mytxt:
txt = mytxt.read()
if __name__ == "__main__":
import time
time1 = time.time()
original_zipfile()
print("running time of original_zipfile = "+str(time.time()-time1))
time1 = time.time()
my_zipfile()
print("running time of my_new_zipfile = "+str(time.time()-time1))
print(myStopwatch.getPretty())
导致以下时间读数
running time of original_zipfile = 1.0871901512145996
running time of my_new_zipfile = 0.07036209106445312
我将包含源代码,但请注意我的实现有两个小缺陷(一旦你提供了一个提取列表,当你不提供时,行为将与前面提到的相同):