python脚本将目录中的所有文件连接成一个文件

时间:2013-07-19 15:08:06

标签: python

我编写了以下脚本,将目录中的所有文件连接成一个文件。

这可以根据

进行优化
  1. idiomatic python

  2. 时间

  3. 以下是摘录:

    import time, glob
    
    outfilename = 'all_' + str((int(time.time()))) + ".txt"
    
    filenames = glob.glob('*.txt')
    
    with open(outfilename, 'wb') as outfile:
        for fname in filenames:
            with open(fname, 'r') as readfile:
                infile = readfile.read()
                for line in infile:
                    outfile.write(line)
                outfile.write("\n\n")
    

6 个答案:

答案 0 :(得分:30)

使用shutil.copyfileobj复制数据:

import shutil

with open(outfilename, 'wb') as outfile:
    for filename in glob.glob('*.txt'):
        if filename == outfilename:
            # don't want to copy the output into the output
            continue
        with open(filename, 'rb') as readfile:
            shutil.copyfileobj(readfile, outfile)

shutil从块中的readfile对象读取,直接将它们写入outfile文件对象。不要使用readline()或迭代缓冲区,因为您不需要查找行结尾的开销。

使用相同的模式进行读写;这在使用Python 3时尤为重要;我在这里使用了二进制模式。

答案 1 :(得分:2)

使用Python 2.7,我做了一些"基准测试"测试

outfile.write(infile.read())

VS

shutil.copyfileobj(readfile, outfile)

我迭代了20个.txt文件,大小从63 MB到313 MB,联合文件大小约为2.6 GB。在这两种方法中,正常读取模式比二进制读取模式执行得更好,而shutil.copyfileobj通常比outfile.write更快。

将最差组合(outfile.write,二进制模式)与最佳组合(shutil.copyfileobj,正常读取模式)进行比较时,差异非常显着:

outfile.write, binary mode: 43 seconds, on average.

shutil.copyfileobj, normal mode: 27 seconds, on average.

在正常读取模式下,outfile的最终大小为2620 MB,而在二进制读取模式下,最终大小为2578 MB。

答案 2 :(得分:1)

无需使用那么多变量。

with open(outfilename, 'w') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            outfile.write(readfile.read() + "\n\n")

答案 3 :(得分:1)

fileinput模块提供了一种迭代多个文件的自然方式

for line in fileinput.input(glob.glob("*.txt")):
    outfile.write(line)

答案 4 :(得分:1)

我很好奇要检查更多性能,我使用了Martijn Pieters和Stephen Miller的答案。

我尝试使用shutil和不使用shutil的二进制和文本模式。我试图合并270个文件。

文本模式-

def using_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                outfile.write(readfile.read())

二进制模式-

def using_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                outfile.write(readfile.read())

二进制模式的运行时间-

Shutil - 20.161773920059204
Normal - 17.327500820159912

文本模式的运行时间-

Shutil - 20.47757601737976
Normal - 13.718038082122803

就像在两种模式下一样,shutil在文本模式下比二进制模式下执行性能相同。

OS:Mac OS 10.14 Mojave。 Macbook Air2017。

答案 5 :(得分:0)

您可以直接迭代文件对象的行,而无需将整个内容读入内存:

with open(fname, 'r') as readfile:
    for line in readfile:
        outfile.write(line)