我编写了以下脚本,将目录中的所有文件连接成一个文件。
这可以根据
进行优化idiomatic python
时间
以下是摘录:
import time, glob
outfilename = 'all_' + str((int(time.time()))) + ".txt"
filenames = glob.glob('*.txt')
with open(outfilename, 'wb') as outfile:
for fname in filenames:
with open(fname, 'r') as readfile:
infile = readfile.read()
for line in infile:
outfile.write(line)
outfile.write("\n\n")
答案 0 :(得分:30)
使用shutil.copyfileobj
复制数据:
import shutil
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
shutil
从块中的readfile
对象读取,直接将它们写入outfile
文件对象。不要使用readline()
或迭代缓冲区,因为您不需要查找行结尾的开销。
使用相同的模式进行读写;这在使用Python 3时尤为重要;我在这里使用了二进制模式。
答案 1 :(得分:2)
使用Python 2.7,我做了一些"基准测试"测试
outfile.write(infile.read())
VS
shutil.copyfileobj(readfile, outfile)
我迭代了20个.txt文件,大小从63 MB到313 MB,联合文件大小约为2.6 GB。在这两种方法中,正常读取模式比二进制读取模式执行得更好,而shutil.copyfileobj通常比outfile.write更快。
将最差组合(outfile.write,二进制模式)与最佳组合(shutil.copyfileobj,正常读取模式)进行比较时,差异非常显着:
outfile.write, binary mode: 43 seconds, on average.
shutil.copyfileobj, normal mode: 27 seconds, on average.
在正常读取模式下,outfile的最终大小为2620 MB,而在二进制读取模式下,最终大小为2578 MB。
答案 2 :(得分:1)
无需使用那么多变量。
with open(outfilename, 'w') as outfile:
for fname in filenames:
with open(fname, 'r') as readfile:
outfile.write(readfile.read() + "\n\n")
答案 3 :(得分:1)
fileinput模块提供了一种迭代多个文件的自然方式
for line in fileinput.input(glob.glob("*.txt")):
outfile.write(line)
答案 4 :(得分:1)
我很好奇要检查更多性能,我使用了Martijn Pieters和Stephen Miller的答案。
我尝试使用shutil
和不使用shutil
的二进制和文本模式。我试图合并270个文件。
文本模式-
def using_shutil_text(outfilename):
with open(outfilename, 'w') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'r') as readfile:
shutil.copyfileobj(readfile, outfile)
def without_shutil_text(outfilename):
with open(outfilename, 'w') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'r') as readfile:
outfile.write(readfile.read())
二进制模式-
def using_shutil_text(outfilename):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
def without_shutil_text(outfilename):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
outfile.write(readfile.read())
二进制模式的运行时间-
Shutil - 20.161773920059204
Normal - 17.327500820159912
文本模式的运行时间-
Shutil - 20.47757601737976
Normal - 13.718038082122803
就像在两种模式下一样,shutil在文本模式下比二进制模式下执行性能相同。
OS:Mac OS 10.14 Mojave。 Macbook Air2017。
答案 5 :(得分:0)
您可以直接迭代文件对象的行,而无需将整个内容读入内存:
with open(fname, 'r') as readfile:
for line in readfile:
outfile.write(line)