Question

我正在使用core-python API在python 2.7中为我的项目编写详细的文件验证脚本。这是用于比较另一个ETL代码的源文件和目标文件。这包括逐行进行元数据验证，计数验证，重复检查，空检查和完整数据验证。我已经完成了脚本，它可以很好地运行100k数据集（我在100k，200k卷上进行了一些测试）。但是，如果我运行了数百万个数据，则重复检查的方法将永远运行（我的意思是花费大量时间）。已经调试了代码，发现下面的重复检查方法导致了问题。

    def dupFind(dup_list=[],output_path=""):
        #dup_list is the list containing duplicates. Actually this is the list of contents of a file line by line as entries
        #output_path is the path to which output records and respective duplicate count of each records are saved as a single file
        #duplicates is a set which contains tuples with two elements each in which first element is the duplicated record and second is the duplicated count

        duplicates=set((x,dup_list.count(x)) for x in filter(lambda rec : dup_list.count(rec)>1,dup_list)) 
        print "time taken for preparing duplicate list is {}".format(str(t1-t0))
        dup_report="{}\dup.{}".format(output_path, int(time.time()))
        print "Please find the duplicate records  in {}".format(dup_report)
        print ""
        with open(dup_report, 'w+') as f:
            f.write("RECORD|DUPLICATE_COUNT\n")
            for line in duplicates:
                f.write("{}|{}\n".format(line[0], line[1]))

首先，我正在读取文件并将其转换为如下所示的列表（运行速度很快）：

     with open(sys.argv[1]) as src,open(sys.argv[2]) as tgt:
            src = map(lambda x : x.strip(),list(src))
            tgt = map(lambda x : x.strip(),list(tgt))

之后，我将以下逻辑（提供了伪代码）同时应用于“ src”和“ tgt”列表，以查找文件是否重复：

    #here output path is passed as a user argument while running the script

    if len(set(tgt)) < len(tgt) then Target  is duplicated and call dupFind function as dupFind(tgt,outputpath)
    if len(set(src)) < len(src) then source is duplicated and call dupFind function as dupFind(src,outputpath)

因此将复制哪个列表，将由dupFind函数使用，然后将保存重复的记录和相应的计数输入到输出路径中的文件，格式为“ dup.epochtime”。如果我为数百万条记录（甚至1 M）运行整个文件验证脚本，那么它将永远运行。当我在function上调试时，下面的特定行引起了性能问题。

    #here using filter() , I am filtering out duplicates records alone from the duplicated list
    #then creating a tuple over it containg a pair of values in which first element is the duplicated record and second is the duplicated count

    duplicates=set((x,dup_list.count(x)) for x in filter(lambda rec : dup_list.count(rec)>1,dup_list))

输出重复文件看起来像这样：

    RECORD|DUPLICATE_COUNT
    68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT|2
    68835,2014-05-02 00:00:00.0,764,COMPLETE|2
    68878,2014-07-08 00:00:00.0,6753,COMPLETE|2
    68834,2014-05-01 00:00:00.0,6938,COMPLETE|2

任何人都可以帮助我修改逻辑或编写新逻辑，以便一次处理数百万条记录吗？在我的项目中，文件最大为40M或50M。

Answer 1

您正在循环使用list.count。这是非常低效的。取而代之的是，执行一次通过以获取计数，然后进行另一次通过以对这些计数进行过滤。线性时间与二次时间。因此，使用快速的collections.Counter对象：

from collections import Counter
def dupFind(dup_list=(),output_path=""):

    counts = Counter(dup_list)
    duplicates = {(x, c) for x, c in counts.iteritems() if c > 1}
    ...

请注意，我将默认的dup_list参数切换为一个空的元组而不是一个空列表。如果您不了解默认参数的可变性，它们可能会导致错误。

上述解决方案确实需要辅助空间，但是它应该非常快，collections.Counter本质上是为计数而优化的dict。

Answer 2

我没有发现与src和tgt有任何区别，因此在这里为通用列表提供一个解决方案。我认为此解决方案将加快扫描速度。为了提高速度，我将尝试使用pypy或c。

import sys

def dup_find(sequence, marker=object()):
    prev = marker
    c = 1 
    for item in sequence:
        if item == prev:
            c += 1
        else:
            if c > 1:
                yield prev, c
            prev = item
            c = 1 
    if c > 1:
        yield prev, c

def print_dup(sequence, output):
    for item, count in dup_find(sequence):
        output.write('%s|%s\n' % (item, count))

with open(sys.argv[1]) as fp: 
    lines = sorted(map(str.strip, fp))
if len(set(lines)) < len(lines):
    print_dup(lines, sys.stdout)

Python：在包含数百万个数据的文件中查找重复项时出现性能问题

2 个答案: