所以现在我已经硬编码4 if / elif / else语句。有更动态的方法吗?例如,如果我想做10或前夕40路合并?
#4-way merge sort, sorted page files
outfile="fullsorted.txt"
of=open(outfile,"w")
f1=open("temp0-sorted.txt","r")
f2=open("temp1-sorted.txt","r")
f3=open("temp2-sorted.txt","r")
f4=open("temp3-sorted.txt","r")
f1_line=f1.readline()
f2_line=f2.readline()
f3_line=f3.readline()
f4_line=f4.readline()
while len(f1_line)>0 and len(f2_line)>0 and len(f3_line)>0 and len(f4_line)>0:
if f1_line < f2_line and f1_line < f3_line and f1_line < f4_line and len(f1_line)>0:
of.write(f1_line)
f1_line=f1.readline()
elif f2_line < f3_line and f1_line < f4_line and len(f2_line)>0:
of.write(f2_line)
f2_line=f2.readline()
elif f3_line < f4_line and len(f3_line)>0:
of.write(f3_line)
f3_line=f3.readline()
else:
of.write(f4_line)
f4_line=f4.readline()
of.close()
答案 0 :(得分:5)
只需使用heapq.merge
:
import heapq
#4-way merge sort, sorted page files
outfile="fullsorted.txt"
with open("temp0-sorted.txt","r") as f1,\
open("temp1-sorted.txt","r") as f2,\
open("temp2-sorted.txt","r") as f3,\
open("temp3-sorted.txt","r") as f4,\
open(outfile,"w") as of:
of.writelines(heapq.merge(f1, f2, f3, f4))
答案 1 :(得分:1)
使用您自己的代码模式,将其扩展为基于列表的方法,如下所示:
outfile="fullsorted.txt"
of=open(outfile,"w")
files = ["temp0-sorted.txt", "temp1-sorted.txt","temp2-sorted.txt","temp3-sorted.txt"]
filehandles = [open(f, "r") for f in files]
lines = [f.readline() for f in filehandles]
while len(filehandles) > 0:
smallest = min(lines)
smallestposition = lines.index(smallest)
of.write(smallest)
lines[smallestposition] = filehandles[smallestposition].readline()
if lines[smallestposition] == "":
filehandles[smallestposition].close()
filehandles.pop(smallestposition)
lines.pop(smallestposition)
of.close()
请注意,这将合并整个文件,而不是在一个文件用完时立即停止。
答案 2 :(得分:0)
感谢大家提示,这是我的解决方案:
sorted_files=[]
strings=[]
for i in xrange(files+1):
sorted_files.append(open("temp"+str(i)+"-sorted.txt","r"))
strings.append(sorted_files[i].readline())
print len(sorted_files)
print strings
eofs=0
while eofs != 1:
small_str=min(filter(lambda x: x != "", strings))
str_index=strings.index(small_str)
of.write(small_str)
strings[str_index]=sorted_files[str_index].readline()
if all(i =="" for i in strings):
eofs=1
作为基准测试,我在一个大约650万行(~700MB)的文件中对此进行了测试,将其分页为500,000行文件,然后按字典顺序对这些文件进行快速排序,并进行排序合并(以及真正合并)上面的代码,所以合并了大约128个文件,(我有一个20亿行文件,但在删除页面文件时意外删除了它),它对文件进行了排序,并在16分钟内找到了重复文件:
real 15m54.375s
user 15m52.096s
sys 0m3.000s
这是我第一个这种性质的剧本,如果你能给我一些反馈,好像页面大小合适,并且使用的排序方法是正确的,我会非常高兴。页面文件生成并快速排序,但合并时间最长。