Question

这类似于merge sort in python中的问题我正在重申，因为我不认为我在那边很好地解释了这个问题。

基本上我有一系列大约1000个文件都包含域名。总的来说，数据是> 1gig所以我试图避免将所有数据加载到ram中。每个单独的文件都使用.sort（get_tld）进行排序，它根据TLD对数据进行排序（不是根据其域名排序。将所有.com排在一起，.orgs排在一起等）

典型文件可能看起来像

something.ca
somethingelse.ca
somethingnew.com
another.net
whatever.org
etc.org

但显然更长。

我现在想要将所有文件合并为一个，保持排序，以便最终一个大文件仍然可以将所有.com放在一起，.orgs一起等等。

我基本上想做的是

open all the files
loop:
    read 1 line from each open file
    put them all in a list and sort with .sort(get_tld)
    write each item from the list to a new file

我遇到的问题是我无法弄清楚如何循环文件我不能将与open（）一起用作，因为我没有打开1个文件循环，我有很多。而且它们都是可变长度的，所以我必须确保通过最长的一个。

非常感谢任何建议。

Answer 1

您是否能够同时保留1000个文件是一个单独的问题，取决于您的操作系统及其配置;如果没有，你将不得不分两步 - 将N个文件组合并为临时文件，然后将临时文件合并到最终结果文件中（两个步骤应该足够了，因为它们可以合并N个平方文件;只要N至少为32，就可以合并1000个文件）。在任何情况下，这都是“将N个输入文件合并到一个输出文件”任务的一个单独问题（这只是一个问题，无论你是一次还是反复调用该函数）。

该功能的一般想法是保持优先级队列（模块heapq擅长;-)使用包含“排序密钥”（当前TLD，在您的情况下）的小列表，然后是最后一行从文件读取，最后打开文件准备好读取下一行（以及两者之间的不同之处，以确保正常的词典顺序不会意外地最终尝试比较两个打开的文件，这将失败）。我认为一些代码可能是解释一般概念的最佳方式，所以接下来我将编辑这个答案以提供代码（但是我没有时间测试它，所以把它作为伪代码打算传达这个想法; - ）。

import heapq

def merge(inputfiles, outputfile, key):
  """inputfiles: list of input, sorted files open for reading.
     outputfile: output file open for writing.
     key: callable supplying the "key" to use for each line.
  """
  # prepare the heap: items are lists with [thekey, k, theline, thefile]
  # where k is an arbitrary int guaranteed to be different for all items,
  # theline is the last line read from thefile and not yet written out,
  # (guaranteed to be a non-empty string), thekey is key(theline), and
  # thefile is the open file
  h = [(k, i.readline(), i) for k, i in enumerate(inputfiles)]
  h = [[key(s), k, s, i] for k, s, i in h if s]
  heapq.heapify(h)

  while h:
    # get and output the lowest available item (==available item w/lowest key)
    item = heapq.heappop(h)
    outputfile.write(item[2])

    # replenish the item with the _next_ line from its file (if any)
    item[2] = item[3].readline()
    if not item[2]: continue  # don't reinsert finished files

    # compute the key, and re-insert the item appropriately
    item[0] = key(item[2])
    heapq.heappush(h, item)

当然，在你的情况下，作为key函数，你需要一个提取顶级域名给定一个域名（带有尾随换行符）的行 - 在上一个问题中你是为了这个目的，已经指出urlparse模块比字符串操作更可取。如果你坚持使用字符串操作，

def tld(domain):
  return domain.rsplit('.', 1)[-1].strip()

在这种约束条件下，

或沿着这些方向的东西可能是一种合理的方法。

如果使用Python 2.6或更高版本，heapq.merge是明显的选择，但在这种情况下，您需要自己准备迭代器（包括确保“打开文件对象”永远不会被意外地进行比较。。）使用类似的“装饰/不装饰”方法，我在上面的更多可移植代码中使用。

Answer 2

您想要使用合并排序，例如heapq.merge。我不确定您的操作系统是否允许您同时打开1000个文件。如果不是，你可能必须在2次或更多次通过。

Answer 3

为什么不用首字母划分域名，所以你只需将源文件拆分成26个或更多文件，这些文件可以命名为：domains-a.dat，domains-b.dat。然后你可以将它们完全加载到RAM中并对它们进行排序并将它们写入一个公共文件。

所以： 3个输入文件分为26个源文件可以单独加载26个以上的源文件，在RAM中排序，然后写入组合文件。

如果26个文件不够，我相信你可以分成更多的文件...... domains-ab.dat。重点是文件便宜且易于使用（使用Python和许多其他语言），您应该利用它们来发挥优势。

Answer 4

合并已排序文件的算法不正确。你所做的是从每个文件中读取一行，找到读取的所有行中排名最低的项目，并将其写入输出文件。重复此过程（忽略任何处于EOF的文件），直到到达所有文件的末尾。

Answer 5

#! /usr/bin/env python

"""Usage: unconfuse.py file1 file2 ... fileN

Reads a list of domain names from each file, and writes them to standard output grouped by TLD.
"""

import sys, os

spools = {}

for name in sys.argv[1:]:
    for line in file(name):
        if (line == "\n"): continue
        tld = line[line.rindex(".")+1:-1]
        spool = spools.get(tld, None)
        if (spool == None):
            spool = file(tld + ".spool", "w+")
            spools[tld] = spool
        spool.write(line)

for tld in sorted(spools.iterkeys()):
    spool = spools[tld]
    spool.seek(0)
    for line in spool:
        sys.stdout.write(line)
    spool.close()
    os.remove(spool.name)

令人困惑的循环问题（python）

5 个答案: