Question

我有大约30个500MB文件，每行一个字。我有一个脚本，用伪bash：

执行此操作

for i in *; do
    echo "" > everythingButI
    for j in *-except-$i; do
        cat $j >> everythingButI
        sort everythingButI | uniq > tmp
        mv tmp everythingButI
    done
    comm $i everythingButI -2 -3 > uniqueInI

    percentUnique=$(wc -l uniqueInI) / $(wc -l $i) * 100
    echo "$i is $percentUnique% Unique"
done

它计算每个文件的“唯一性”（文件已经在每个文件中排序和唯一）。

所以，如果我有文件：

file1    file2   file3
a        b       1
c        c       c
d        e       e
f        g
         h

file1将是75％唯一（因为其中1/4的行在另一个文件中找到），file2将是60％唯一，而file3将是33.33％唯一。但是要把它制作成30个500MB的文件，这需要一点时间才能运行。

我想编写一个python脚本，这样做的速度要快得多，但是我想知道实际上最快的算法是什么。（我在PC上也只有2GB的RAM。）

任何人都有关于算法的意见，或者知道更快的方法吗？

Answer 1

编辑：由于每个输入都已经过内部排序和重复数据删除，因此您实际上需要为此进行 n -way合并，并且需要进行哈希构建练习。这篇文章的上一个版本是毫无意义的。

如果你不小心的话， n -way合并是一种错综复杂的行为。基本上，它的工作原理如下：

读入每个文件的第一行，并初始化其唯一的行计数器，总行计数器为0.
做这个循环体：
- 在读取的行中找到最小值。
- 如果该值与任何其他文件中的值不同，请增加该文件的唯一行计数器。
- 对于每个文件，如果最小值等于读取的最后一个值，则读入下一行并递增该文件的总行计数器。如果您点击文件末尾，则表示您已完成该文件：将其从进一步考虑中删除。
循环，直到您没有考虑任何文件。此时，您应该为每个文件准备一个精确的唯一行计数器和总行计数器。那么百分比就是乘法和除法的简单问题。

我省略了使用合并算法的完整形式的优先级队列;只有当你有足够多的输入文件时才会变得很重要。

Answer 2

使用修改后的N/K-way sort algorithm来处理传递中的整个比较文件集。只需要计算和推进;合并部分本身可以跳过。

这利用了输入已经排序的事实。如果它们尚未排序，请对它们进行排序并将它们存储在已排序的磁盘上:-)让操作系统文件缓冲区和预读为您的朋友。

快乐的编码。

有一点聪明，我相信这也可以扩展到一次通过中所有文件之间的百分比差异。只需要跟踪每组关系的“尾随”输入和计数器（m-m与1-m）。

对于我在问题中提供的数据似乎有用的扰流器代码......

当然，我还没有在非常大的文件上测试过这个，或者真的，根本没有。 “它跑了”。上面“独特”的定义比我最初想的要简单，所以前面的一些答案并不适用。这段代码远非完美。使用风险自负（计算机爆炸和厌倦/厌恶，因为没有更好的发动！）。在Python 3.1上运行。

import os
import itertools

# see: http://docs.python.org/dev/library/itertools.html#itertools-recipes
# modified for 3.x and eager lists
def partition(pred, iterable):
    t1, t2 = itertools.tee(iterable)
    return list(itertools.filterfalse(pred, t1)), list(filter(pred, t2))

# all files here
base = "C:/code/temp"
names = os.listdir(base)

for n in names:
    print("analyzing {0}".format(n))

# {name => file}
# files are removed from here as they are exhausted
files = dict([n, open(os.path.join(base,n))] for n in names)

# {name => number of shared items in any other list}
shared_counts = {}
# {name => total items this list}
total_counts = {}
for n in names:
    shared_counts[n] = 0
    total_counts[n] = 0

# [name, currentvalue] -- remains mostly sorted and is
# always a very small n so sorting should be lickity-split
vals = []
for n, f in files.items():
    # assumes no files are empty
    vals.append([n, str.strip(f.readline())])
    total_counts[n] += 1

while len(vals):
    vals = sorted(vals, key=lambda x:x[1])
    # if two low values are the same then the value is not-unique
    # adjust the logic based on definition of unique, etc.
    low_value = vals[0][1]
    lows, highs = partition(lambda x: x[1] > low_value, vals)
    if len(lows) > 1:
        for lname, _ in lows:
            shared_counts[lname] += 1
    # all lowest items discarded and refetched
    vals = highs
    for name, _ in lows:
        f = files[name]
        val = f.readline()
        if val != "":
            vals.append([name, str.strip(val)])
            total_counts[name] += 1
        else:
            # close files as we go. eventually we'll
            # dry-up the 'vals' and quit this mess :p
            f.close()
            del files[name]

# and what we want...
for n in names:
    unique = 1 - (shared_counts[n]/total_counts[n])
    print("{0} is {1:.2%} unique!".format(n, unique))

回顾一下，我已经看到了这些瑕疵！ :-) vals的排序是为了一个不再真正适用的遗留原因。实际上，只有一个min在这里工作正常（对于任何相对较小的文件集可能更好）。

Answer 3

这是一些非常丑陋的伪代码，用于进行n路合并

#!/usr/bin/python

import sys, os, commands
from goto import goto, label

def findmin(linesread):
    min = ""
    indexes = []
    for i in range(len(linesread)):
        if linesread[i] != "":
            min = linesread[i]
            indexes.append(i)
            break
    for i in range(indexes[0]+1, len(linesread)):
        if linesread[i] < min and linesread[i] != "":
            min = linesread[i]
            indexes = [i]
        elif linesread[i] == min:
            indexes.append(i)
    return min, indexes

def genUniqueness(path):
    wordlists = []
    linecount = []

    log = open(path + ".fastuniqueness", 'w')

    for root, dirs, files in os.walk(path):
        if root.find(".git") > -1 or root == ".":
            continue
        if root.find("onlyuppercase") > -1:
            continue

        for i in files:
            if i.find('lvl') >= 0 or i.find('trimmed') >= 0:
                wordlists.append( root + "/" + i );
                linecount.append(int(commands.getoutput("cat " + root + "/" + i + " | wc -l")))
                print root + "/" + i


    whandles = []
    linesread = []
    numlines = []
    uniquelines = []
    for w in wordlists:
        whandles.append(open(w, 'r'))
        linesread.append("")
        numlines.append(0)
        uniquelines.append(0)

    count = range(len(whandles))
    for i in count:
        linesread[i] = whandles[i].readline().strip()
        numlines[i] += 1

    while True:
        (min, indexes) = findmin(linesread)
        if len(indexes) == 1:
            uniquelines[indexes[0]] += 1
        for i in indexes:
            linesread[i] = whandles[i].readline().strip()
            numlines[i] += 1
            if linesread[i] == "":
                numlines[i] -= 1
                whandles[i] = 0
                print "Expiring ", wordlists[i]
        if not any(linesread):
            break


    for i in count:
        log.write(wordlists[i] + "," + str(uniquelines[i]) + "," + str(numlines[i]) + "\n")
        print wordlists[i], uniquelines[i], numlines[i]

与其他几个大文件相比，计算文件唯一性（％）的最有效方法

3 个答案:

对于我在问题中提供的数据似乎有用的扰流器代码......