Question

基本上我有一堆包含域名的文件。我使用.sort（key = func_that_returns_tld）

根据TLD对每个文件进行了排序

现在我已经完成了我想要合并所有文件并最终获得一个大型排序文件。我想我需要这样的东西：

open all files
read one line from each file into a list
sort list with .sort(key=func_that_returns_tld)
output that list to file
loop by reading next line

我正在考虑这个问题吗？如何实现这一点的任何建议将不胜感激。

Answer 1

如果您的文件不是很大，那么只需将它们全部读入内存（如S. Lott建议的那样）。那绝对是最简单的。

但是，您提到校对会创建一个“大量”文件。如果它太大而不适合记忆，那么也许使用heapq.merge。设置可能有点困难，但它的优点是不要求所有迭代都立即被拉入内存。

import heapq
import contextlib

class Domain(object):
    def __init__(self,domain):
        self.domain=domain
    @property
    def tld(self):
        # Put your function for calculating TLD here
        return self.domain.split('.',1)[0]
    def __lt__(self,other):
        return self.tld<=other.tld
    def __str__(self):
        return self.domain

class DomFile(file):
    def next(self):
        return Domain(file.next(self).strip())

filenames=('data1.txt','data2.txt')
with contextlib.nested(*(DomFile(filename,'r') for filename in filenames)) as fhs:
    for elt in heapq.merge(*fhs):
        print(elt)

with data1.txt：

google.com
stackoverflow.com
yahoo.com

和data2.txt：

standards.freedesktop.org
www.imagemagick.org

的产率：

google.com
stackoverflow.com
standards.freedesktop.org
www.imagemagick.org
yahoo.com

Answer 2

除非您的文件难以理解，否则它将适合内存。

您的伪代码难以阅读。请正确缩进您的伪代码。通过阅读下一行的最后“循环”是没有意义的。

基本上就是这样。

all_data= []
for f in list_of_files:
    with open(f,'r') as source:
        all_data.extend( source.readlines() )
all_data.sort(... whatever your keys are... )

你已经完成了。您可以将all_data写入文件，或进一步处理或使用它进行处理。

Answer 3

另一个选项（同样，只有当你的所有数据都不适合内存时）才能创建一个SQLite3数据库并在那里进行排序并在之后将其写入文件。

在python中合并排序

3 个答案: