如何仅合并从file_a到file_b的唯一行?

时间:2011-03-23 15:01:13

标签: python python-2.3

这个问题已经以这种或那种形式被问到,但不是我正在寻找的东西。所以,这就是我将要遇到的情况:我已经有一个名为file_a的文件,我正在创建另一个文件 - file_b。 file_a的大小总是大于file_b。 file_b中会有许多重复的行(因此,在file_a中也是如此),但这两个文件都有一些独特的行。我想要做的是:仅复制/合并从file_afile_b的唯一行,然后对行顺序进行排序,以便file_b成为最新的行与所有独特的条目。两个原始文件的大小不应超过10MB。我能做到的最有效(也是最快)的方法是什么?

我在考虑这样的事情,合并就好了。

#!/usr/bin/env python

import os, time, sys

# Convert Date/time to epoch
def toEpoch(dt):
    dt_ptrn = '%d/%m/%y %H:%M:%S'
    return int(time.mktime(time.strptime(dt, dt_ptrn)))

# input files
o_file = "file_a"
c_file = "file_b"
n_file = [o_file,c_file]

m_file = "merged.file"

for x in range(len(n_file)):
    P = open(n_file[x],"r")
    output = P.readlines()
    P.close()

    # Sort the output, order by 2nd last field
    #sp_lines = [ line.split('\t') for line in output ]
    #sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )

    F = open(m_file,'w') 
    #for line in sp_lines:
    for line in output:
        if "group_" in line:
            F.write(line)
    F.close()

但是,它是:

  • 不仅有唯一的行
  • 未排序(由最后一个字段旁边)
  • 并介绍第3个文件,即m_file

只是旁注(长话短说):我不能在这里使用sorted(),因为我正在使用v2.3,不幸的是。输入文件如下所示:

On 23/03/11 00:40:03
JobID   Group.User          Ctime   Wtime   Status  QDate               CDate
===================================================================================
430792  group_atlas.pltatl16    0   32  4   02/03/11 21:52:38   02/03/11 22:02:15
430793  group_atlas.atlas084    30  472 4   02/03/11 21:57:43   02/03/11 22:09:35
430794  group_atlas.atlas084    12  181 4   02/03/11 22:02:37   02/03/11 22:05:42
430796  group_atlas.atlas084    8   185 4   02/03/11 22:02:38   02/03/11 22:05:46

我尝试使用cmp()按第二个字段排序,但是,我认为它只是因为输入文件的前三行而无效。

有人可以帮忙吗?干杯!!!

<小时/> 更新1:

对于将来的参考,正如Jakob所建议的,这是完整的脚本。它运作得很好。

#!/usr/bin/env python

import os, time, sys
from sets import Set as set

def toEpoch(dt):
    dt_ptrn = '%d/%m/%y %H:%M:%S'
    return int(time.mktime(time.strptime(dt, dt_ptrn)))

def yield_lines(fileobj):
    #I want to discard the headers
    for i in xrange(3):
        fileobj.readline()
    #
    for line in fileobj:
        yield line

def app(path1, path2):
    file1 = set(yield_lines(open(path1)))
    file2 = set(yield_lines(open(path2)))
    return file1.union(file2)

# Input files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"

print time.strftime('%H:%M:%S', time.localtime())

# Sorting the output, order by 2nd last field
sp_lines = [ line.split('\t') for line in app(o_file, c_file) ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )

F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)

for line in sp_lines:

    MF = '\t'.join(line)
    F.write(MF)
F.close()

花了大约2米:47秒来完成145244行。

[testac1@serv07 ~]$ ./uniq-merge.py 
17:19:21
No. of lines:  145244
17:22:08

谢谢!

<小时/> 更新2:

嗨eyquem,这是我运行脚本时收到的错误消息。

来自第一个脚本:

[testac1@serv07 ~]$ ./uniq-merge_2.py 
  File "./uniq-merge_2.py", line 44
    fm.writelines( '\n'.join(v)+'\n' for k,v in output )
                                       ^
SyntaxError: invalid syntax

来自第二个脚本:

[testac1@serv07 ~]$ ./uniq-merge_3.py 
  File "./uniq-merge_3.py", line 24
    output = sett(line.rstrip() for line in fa)
                                  ^
SyntaxError: invalid syntax

干杯!!

<小时/> 更新3:

前一个根本没有对列表进行排序。感谢eyquem指出这一点。嗯,现在就做了。这是对Jakob版本的进一步修改 - 我将set:app(path1,path2)转换为list:myList(),然后将sort(lambda ...)应用于myList以对合并文件进行排序由巢到最后的领域。这是最后的剧本。

#!/usr/bin/env python

import os, time, sys
from sets import Set as set

def toEpoch(dt):
    # Convert date/time to epoch
    dt_ptrn = '%d/%m/%y %H:%M:%S'
    return int(time.mktime(time.strptime(dt, dt_ptrn)))

def yield_lines(fileobj):
    # Discard the headers (1st 3 lines)
    for i in xrange(3):
        fileobj.readline()

    for line in fileobj:
        yield line

def app(path1, path2):
    # Remove duplicate lines
    file1 = set(yield_lines(open(path1)))
    file2 = set(yield_lines(open(path2)))
    return file1.union(file2)

print time.strftime('%H:%M:%S', time.localtime())

# I/O files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"

# Convert set into to list
myList = list(app(o_file, c_file))

# Sort the list by the date
sp_lines = [ line.split('\t') for line in myList ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )

F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)

# Finally write to the outFile
for line in sp_lines:
    MF = '\t'.join(line)
    F.write(MF)
F.close()

根本没有速度提升,处理相同的145244线需要2m:50s。有人看到任何改进的范围,请告诉我。感谢Jakob和eyquem的时间。干杯!!

<小时/> 更新4:

仅供将来参考,这是 eyguem 的修改版本,其效果比以前更好,更快。

#!/usr/bin/env python

import os, sys, re
from sets import Set as sett
from time import mktime, strptime, strftime

def sorting_merge(o_file, c_file, m_file ):

    # RegEx for Date/time filed
    pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d')

    def kl(lines,pat = pat):
        # match only the next to last field
        line = lines.split('\t')
        line = line[-2]
        return mktime(strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

    output = sett()
    head = []

    # Separate the header & remove the duplicates
    def rmHead(f_n):
        f_n.readline()
        for line1 in f_n:
            if pat.search(line1):  break
            else:  head.append(line1) # line of the header
        for line in f_n:
            output.add(line.rstrip())
        output.add(line1.rstrip())
        f_n.close()

    fa = open(o_file, 'r')
    rmHead(fa)

    fb = open(c_file, 'r')
    rmHead(fb)

    # Sorting date-wise
    output = [ (kl(line),line.rstrip()) for line in output if line.rstrip() ]
    output.sort()

    fm = open(m_file,'w')
    # Write to the file & add the header
    fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head[0]+head[1])))
    for t,line in output:
        fm.write(line + '\n')
    fm.close()


c_f = "03_a"
o_f = "03_b"

sorting_merge(o_f, c_f, 'outfile.txt')

这个版本要快得多 - 6.99秒。对于145244行与2m:47s相比 - 然后是前一个使用lambda a, b: cmp()。感谢eyquem的所有支持。干杯!!

4 个答案:

答案 0 :(得分:2)

也许是这些方面的东西?

from sets import Set as set

def yield_lines(fileobj):
    #I want to discard the headers
    for i in xrange(3):
        fileobj.readline()

    for line in fileobj:
        yield line

def app(path1, path2):
    file1 = set(yield_lines(open(path1)))
    file2 = set(yield_lines(open(path2)))

    return file1.union(file2)

编辑:忘记:$

答案 1 :(得分:2)

编辑2

我之前的代码在output = sett(line.rstrip() for line in fa)output.sort(key=kl)

方面存在问题

此外,他们有一些并发症。

所以我检查了使用Jakob Bowyer在他的代码中采用的set()函数直接读取文件的选择。

祝贺雅各布!(顺便说一下,Michal Chruszcz):set()是无与伦比的,它比一次读取一行更快。

然后,我放弃了我的想法,一行一行地读取文件。

但我保留了我的想法,以避免在cmp()函数的帮助下进行排序,因为正如文档中所描述的那样:

  

s.sort([cmpfunc =无])

     

sort()方法采用可选方法   指定比较的参数   两个参数的功能(列表项)   (...)请注意,会减慢排序速度   过程大幅下降

     

http://docs.python.org/release/2.3/lib/typesseq-mutable.html

然后,我设法获得(t,line)的元组列表,其中 t

time.mktime(time.strptime(( 1st date-and-hour in line ,'%d/%m/%y %H:%M:%S'))

通过指令

output = [ (kl(line),line.rstrip()) for line in output]

我测试了2个代码。由于正则表达式,计算了以下第一个日期和时间在行

def kl(line,pat = pat):
    return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

output = [ (kl(line),line.rstrip()) for line in output if line.rstrip()]

output.sort()

第二个代码kl()是:

def kl(line,pat = pat):
    return time.mktime(time.strptime(line.split('\t')[-2],'%d/%m/%y %H:%M:%S'))

结果

  

执行时间:

     带有正则表达式的第一个代码

0.03598秒

     对于带有split('\ t')

的第二个代码,

0.03580秒      

也就是说相同

此算法比使用函数cmp()

的代码更快

一行代码,其中输出的行集未通过

转换为元组列表
output = [ (kl(line),line.rstrip()) for line in output]

但只是在一个行列表中转换(然后没有重复)并使用函数 mycmp()进行排序(参见文档):

def mycmp(a,b):
    return cmp(time.mktime(time.strptime(a.split('\t')[-2],'%d/%m/%y %H:%M:%S')),
               time.mktime(time.strptime(b.split('\t')[-2],'%d/%m/%y %H:%M:%S')))

output = [ line.rstrip() for line in output] # not list(output) , to avoid the problem of newline of the last line of each file
output.sort(mycmp)

for line in output:
    fm.write(line+'\n')

的执行时间为

  

0.11574秒

代码:

#!/usr/bin/env python

import os, time, sys, re
from sets import Set as sett

def sorting_merge(o_file , c_file, m_file ):

    pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                     '(?=[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)') 

    def kl(line,pat = pat):
        return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

    output = sett()
    head = []

    fa = open(o_file)
    fa.readline() # first line is skipped
    while True:
        line1 = fa.readline()
        mat1  = pat.search(line1)
        if not mat1: head.append(line1) # line1 is here a line of the header
        else: break # the loop ends on the first line1 not being a line of the heading
    output = sett( fa )
    fa.close()

    fb = open(c_file)
    while True:
        line1 = fb.readline()
        if pat.search(line1):  break
    output = output.union(sett( fb ))
    fb.close()

    output = [ (kl(line),line.rstrip()) for line in output]
    output.sort()

    fm = open(m_file,'w')
    fm.write(time.strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
    for t,line in output:
        fm.write(line + '\n')
    fm.close()


te = time.clock()
sorting_merge('ytre.txt','tataye.txt','merged.file.txt')
print time.clock()-te

这一次,我希望它能够正常运行,并且唯一要做的就是等待真实文件的执行次数比我测试代码的文件大得多

编辑3

pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                 '(?=[ \t]+'
                 '[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                 '|'
                 '[ \t]+aborted/deleted)')

编辑4

#!/usr/bin/env python

import os, time, sys, re
from sets import Set

def sorting_merge(o_file , c_file, m_file ):

    pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                     '(?=[ \t]+'
                     '[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                     '|'
                     '[ \t]+aborted/deleted)')

    def kl(line,pat = pat):
        return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

    head = []
    output = Set()

    fa = open(o_file)
    fa.readline() # first line is skipped
    for line1 in fa:
        if pat.search(line1):  break # first line after the heading
        else:  head.append(line1) # line of the header
    for line in fa:
        output.add(line.rstrip())
    output.add(line1.rstrip())
    fa.close()

    fb = open(c_file)
    for line1 in fb:
        if pat.search(line1):  break
    for line in fb:
        output.add(line.rstrip())
    output.add(line1.rstrip())
    fb.close()

    if '' in output:  output.remove('')
    output = [ (kl(line),line) for line in output]
    output.sort()

    fm = open(m_file,'w')
    fm.write(time.strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
    for t,line in output:
        fm.write(line+'\n')
    fm.close()

te = time.clock()
sorting_merge('A.txt','B.txt','C.txt')
print time.clock()-te

答案 2 :(得分:0)

我编写了这个新代码,易于使用集合。它比我之前的代码更快。而且,似乎比你的代码

#!/usr/bin/env python

import os, time, sys, re
from sets import Set as sett

def sorting_merge(o_file , c_file, m_file ):

    # Convert Date/time to epoch
    def toEpoch(dt):
        dt_ptrn = '%d/%m/%y %H:%M:%S'
        return int(time.mktime(time.strptime(dt, dt_ptrn)))

    pat = re.compile('([0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)'
                     '[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d') 

    fa = open(o_file)
    head = []
    fa.readline()
    while True:
        line1 = fa.readline()
        mat1 = pat.search(line1)
        if not mat1:
            head.append(('',line1.rstrip()))
        else:
            break
    output = sett((toEpoch(pat.search(line).group(1)) , line.rstrip())
                 for line in fa)
    output.add((toEpoch(mat1.group(1)) , line1.rstrip()))
    fa.close()


    fb = open(c_file)
    while True:
        line1 = fb.readline()
        mat1 = pat.search(line1)
        if mat1:  break
    for line in fb:
        output.add((toEpoch(pat.search(line).group(1)) , line.rstrip()))
    output.add((toEpoch(mat1.group(1)) , line1.rstrip()))
    fb.close()

    output = list(output)
    output.sort()
    output[0:0] = head
    output[0:0] = [('',time.strftime('On %d/%m/%y %H:%M:%S'))]

    fm = open(m_file,'w')
    fm.writelines( line+'\n' for t,line in output)
    fm.close()



te = time.clock()
sorting_merge('ytr.txt','tatay.txt','merged.file.txt')
print time.clock()-te

请注意,此代码将标题放在合并文件中

修改

Aaaaaah ......我明白了......: - ))

执行时间除以3!

#!/usr/bin/env python

import os, time, sys, re
from sets import Set as sett

def sorting_merge(o_file , c_file, m_file ):

    pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                     '(?=[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)') 

    def kl(line,pat = pat):
        return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

    fa = open(o_file)
    head = []
    fa.readline()
    while True:
        line1 = fa.readline()
        mat1 = pat.search(line1)
        if not mat1:
            head.append(line1.rstrip())
        else:
            break
    output = sett(line.rstrip() for line in fa)
    output.add(line1.rstrip())
    fa.close()

    fb = open(c_file)
    while True:
        line1 = fb.readline()
        mat1 = pat.search(line1)
        if mat1:  break
    for line in fb:
        output.add(line.rstrip())
    output.add(line1.rstrip())
    fb.close()

    output = list(output)
    output.sort(key=kl)
    output[0:0] = [time.strftime('On %d/%m/%y %H:%M:%S')] + head

    fm = open(m_file,'w')
    fm.writelines( line+'\n' for line in output)
    fm.close()

te = time.clock()
sorting_merge('ytre.txt','tataye.txt','merged.file.txt')
print time.clock()-te

答案 3 :(得分:0)

我希望最后的代码。

因为我找到了一个杀手级代码。

首先,我创建了两个文件“xxA.txt”和“yyB.txt”,共有30行,有30000行

430559  group_atlas.atlas084    12  181 4       04/03/10 01:38:02   02/03/11 22:05:42
430502  group_atlas.atlas084    12  181 4       23/01/10 21:45:05   02/03/11 22:05:42
430544  group_atlas.atlas084    12  181 4       17/06/11 12:58:10   02/03/11 22:05:42
430566  group_atlas.atlas084    12  181 4       25/03/10 23:55:22   02/03/11 22:05:42

使用以下代码:

创建AB.py

from random import choice

n = tuple( str(x) for x in xrange(500,600))
days = ('01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16',
        '17','18','19','20','21','22','23','24','25','26','27','28')
# not '29','30,'31' to avoid problems with strptime() on last days of february
months = days[0:12]
hours = days[0:23]
ms = ['00','01','02','03','04','05','06','07','09'] + [str(x) for x in xrange(10,60)]

repeat = 30000

with open('xxA.txt','w') as f:
    # 430794  group_atlas.atlas084    12  181 4     02/03/11 22:02:37   02/03/11 22:05:42
    ch = ('On 23/03/11 00:40:03\n'
          'JobID   Group.User          Ctime   Wtime   Status  QDate               CDate\n'
          '===================================================================================\n')
    f.write(ch)
    for i in xrange(repeat):
        line  = '430%s  group_atlas.atlas084    12  181 4   \t%s/%s/%s %s:%s:%s\t02/03/11 22:05:42\n' %\
                (choice(n),
                 choice(days),choice(months),choice(('10','11')),
                 choice(hours),choice(ms),choice(ms))
        f.write(line)


with open('yyB.txt','w') as f:
    # 430794  group_atlas.atlas084    12  181 4     02/03/11 22:02:37   02/03/11 22:05:42
    ch = ('On 25/03/11 13:45:24\n'
          'JobID   Group.User          Ctime   Wtime   Status  QDate               CDate\n'
          '===================================================================================\n')
    f.write(ch)
    for i in xrange(repeat):
        line  = '430%s  group_atlas.atlas084    12  181 4   \t%s/%s/%s %s:%s:%s\t02/03/11 22:05:42\n' %\
                (choice(n),
                 choice(days),choice(months),choice(('10','11')),
                 choice(hours),choice(ms),choice(ms))
        f.write(line)

with open('xxA.txt') as g:
    print 'readlines of xxA.txt :',len(g.readlines())
    g.seek(0,0)
    print 'set of xxA.txt :',len(set(g))

with open('yyB.txt') as g:
    print 'readlines of yyB.txt :',len(g.readlines())
    g.seek(0,0)
    print 'set of yyB.txt :',len(set(g))

然后我运行了这3个程序:

“合并regex.py”

#!/usr/bin/env python

from time import clock,mktime,strptime,strftime
from sets import Set
import re

infunc = []

def sorting_merge(o_file, c_file, m_file ):
    infunc.append(clock()) #infunc[0]
    pat = re.compile('([0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)')
    output = Set()

    def rmHead(filename, a_set):
        f_n = open(filename, 'r')
        f_n.readline()
        head = []
        for line in f_n:
            head.append(line) # line of the header
            if line.strip('= \r\n')=='':  break
        for line in f_n:
            a_set.add(line.rstrip())
        f_n.close()
        return head

    infunc.append(clock()) #infunc[1]
    head = rmHead(o_file, output)
    infunc.append(clock()) #infunc[2]
    head = rmHead(c_file, output)
    infunc.append(clock()) #infunc[3]
    if '' in output:  output.remove('')

    infunc.append(clock()) #infunc[4]
    output = [ (mktime(strptime(pat.search(line).group(),'%d/%m/%y %H:%M:%S')),line)
               for line in output ]
    infunc.append(clock()) #infunc[5]
    output.sort()
    infunc.append(clock()) #infunc[6]

    fm = open(m_file,'w')
    fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
    for t,line in output:
        fm.write(line + '\n')
    fm.close()
    infunc.append(clock()) #infunc[7]



c_f = "xxA.txt"
o_f = "yyB.txt"

t1 = clock()
sorting_merge(o_f, c_f, 'zz_mergedr.txt')
t2 = clock()
print 'merging regex'
print 'total time of execution :',t2-t1
print '              launching :',infunc[1] - t1
print '            preparation :',infunc[1] - infunc[0]
print '    reading of 1st file :',infunc[2] - infunc[1]
print '    reading of 2nd file :',infunc[3] - infunc[2]
print '      output.remove(\'\') :',infunc[4] - infunc[3]
print 'creation of list output :',infunc[5] - infunc[4]
print '      sorting of output :',infunc[6] - infunc[5]
print 'writing of merging file :',infunc[7] - infunc[6]
print 'closing of the function :',t2-infunc[7]

“合并split.py”

#!/usr/bin/env python

from time import clock,mktime,strptime,strftime
from sets import Set

infunc = []

def sorting_merge(o_file, c_file, m_file ):
    infunc.append(clock()) #infunc[0]
    output = Set()

    def rmHead(filename, a_set):
        f_n = open(filename, 'r')
        f_n.readline()
        head = []
        for line in f_n:
            head.append(line) # line of the header
            if line.strip('= \r\n')=='':  break
        for line in f_n:
            a_set.add(line.rstrip())
        f_n.close()
        return head

    infunc.append(clock()) #infunc[1]
    head = rmHead(o_file, output)
    infunc.append(clock()) #infunc[2]
    head = rmHead(c_file, output)
    infunc.append(clock()) #infunc[3]
    if '' in output:  output.remove('')

    infunc.append(clock()) #infunc[4]
    output = [ (mktime(strptime(line.split('\t')[-2],'%d/%m/%y %H:%M:%S')),line)
               for line in output ]
    infunc.append(clock()) #infunc[5]
    output.sort()
    infunc.append(clock()) #infunc[6]

    fm = open(m_file,'w')
    fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
    for t,line in output:
        fm.write(line + '\n')
    fm.close()
    infunc.append(clock()) #infunc[7]



c_f = "xxA.txt"
o_f = "yyB.txt"

t1 = clock()
sorting_merge(o_f, c_f, 'zz_mergeds.txt')
t2 = clock()
print 'merging split'
print 'total time of execution :',t2-t1
print '              launching :',infunc[1] - t1
print '            preparation :',infunc[1] - infunc[0]
print '    reading of 1st file :',infunc[2] - infunc[1]
print '    reading of 2nd file :',infunc[3] - infunc[2]
print '      output.remove(\'\') :',infunc[4] - infunc[3]
print 'creation of list output :',infunc[5] - infunc[4]
print '      sorting of output :',infunc[6] - infunc[5]
print 'writing of merging file :',infunc[7] - infunc[6]
print 'closing of the function :',t2-infunc[7]

“合并杀手”

#!/usr/bin/env python

from time import clock,strftime
from sets import Set
import re

infunc = []

def sorting_merge(o_file, c_file, m_file ):
    infunc.append(clock()) #infunc[0]
    patk = re.compile('([0123]\d)/([01]\d)/(\d{2}) ([012]\d:[0-6]\d:[0-6]\d)')
    output = Set()

    def rmHead(filename, a_set):
        f_n = open(filename, 'r')
        f_n.readline()
        head = []
        for line in f_n:
            head.append(line) # line of the header
            if line.strip('= \r\n')=='':  break
        for line in f_n:
            a_set.add(line.rstrip())
        f_n.close()
        return head

    infunc.append(clock()) #infunc[1]
    head = rmHead(o_file, output)
    infunc.append(clock()) #infunc[2]
    head = rmHead(c_file, output)
    infunc.append(clock()) #infunc[3]
    if '' in output:  output.remove('')

    infunc.append(clock()) #infunc[4]
    output = [ (patk.search(line).group(3,2,1,4),line)for line in output ]
    infunc.append(clock()) #infunc[5]
    output.sort()
    infunc.append(clock()) #infunc[6]

    fm = open(m_file,'w')
    fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
    for t,line in output:
        fm.write(line + '\n')
    fm.close()
    infunc.append(clock()) #infunc[7]



c_f = "xxA.txt"
o_f = "yyB.txt"

t1 = clock()
sorting_merge(o_f, c_f, 'zz_mergedk.txt')
t2 = clock()
print 'merging killer'
print 'total time of execution :',t2-t1
print '              launching :',infunc[1] - t1
print '            preparation :',infunc[1] - infunc[0]
print '    reading of 1st file :',infunc[2] - infunc[1]
print '    reading of 2nd file :',infunc[3] - infunc[2]
print '      output.remove(\'\') :',infunc[4] - infunc[3]
print 'creation of list output :',infunc[5] - infunc[4]
print '      sorting of output :',infunc[6] - infunc[5]
print 'writing of merging file :',infunc[7] - infunc[6]
print 'closing of the function :',t2-infunc[7]

结果

merging regex
total time of execution : 14.2816595405
              launching : 0.00169211450059
            preparation : 0.00168093989599
    reading of 1st file : 0.163582242995
    reading of 2nd file : 0.141301478261
      output.remove('') : 2.37460347614e-05
     creation of output : 13.4460212122
      sorting of output : 0.216363532237
writing of merging file : 0.232923737514
closing of the function : 0.0797514767938

merging split
total time of execution : 13.7824474898
              launching : 4.10666718815e-05
            preparation : 2.70984161395e-05
    reading of 1st file : 0.154349784679
    reading of 2nd file : 0.136050810927
      output.remove('') : 2.06730184981e-05
     creation of output : 12.9691854691
      sorting of output : 0.218704332534
writing of merging file : 0.225259076223
closing of the function : 0.0788362766776

merging killer
total time of execution : 2.14315311024
              launching : 0.00206199391263
            preparation : 0.00205026057781
    reading of 1st file : 0.158711791582
    reading of 2nd file : 0.138976601775
      output.remove('') : 2.37460347614e-05
     creation of output : 0.621466415424
      sorting of output : 0.823161602941
writing of merging file : 0.227701565422
closing of the function : 0.171049393149

在杀手程序中,排序输出需要4倍的时间,但作为列表创建输出的时间除以21! 然后全局,执行的时间至少减少了85%。