使用约束计算文件中的重复对

时间:2014-07-19 04:48:30

标签: python algorithm optimization duplicates text-processing

问题陈述

我有以下文件(文件按照所有三列排序):

D000001 D000001 1975
D000001 D000001 1976
D000001 D002413 1976
D000001 D002413 1979
D000001 D002413 1987
D000001 D004298 1976
D000002 D000002 1985
D000003 D000900 1975
D000003 D000900 1990
D000003 D004134 1983
D000003 D004134 1986

我需要计算重复对(在第1列和第2列中),并且每个这样的对分配第3列中的最低值。对于我的玩具文件,输出应为:

D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983

我的问题

  1. 文件很大(1 GB到5 GB),我想知道在这个设置中最合适的编程结构是什么?
  2. 如何正确打印最后一个(第3列)?在当前设置(下面的检查代码)中,程序打印最后一个(最高)值。
  3. 我对当前输出的初始尝试如下。

    my_dict = {}
    
    with open('test.srt') as infile:
      for line in infile:
        line = line.rstrip()
        word1, word2, year = line.split('|')
        year = int(year)
        my_tuple = (word1, word2)
        if my_tuple in my_dict:
          freq += 1
          my_dict[my_tuple] = (freq, year)
        else:
          freq = 1
          my_dict[my_tuple] = (freq, year)
    
    for key, value in my_dict.items():
      print key[0], key[1], value
    

    当前输出:

    D000001 D000001 (2, 1976) ## Should be 1976 etc.
    D000001 D002413 (3, 1987)
    D000001 D004298 (1, 1976)
    D000002 D000002 (1, 1985)
    D000003 D000900 (2, 1990)
    D000003 D004134 (2, 1986)
    

4 个答案:

答案 0 :(得分:3)

由于文件很大,因此不应使用内存中的字典来管理数据。开始读取源文件并将结果直接输出到目标文件,您真正需要的只有3个变量,

一个用于存储当前元组,第二个用于存储计数,第三个用于存储最高值。当元组更改时,将值写入输出文件并继续。

这个内存占用很少,并且可以处理疯狂的大文件。但当然这只会因为你的元组被排序而起作用。

答案 1 :(得分:3)

Groupby和generator的发展方向:

import csv
from itertools import groupby

def count_duplicate(it):
    # group by frist two fields
    groups = groupby(it, lambda line: line[:2])
    # this will produce (key, group) pairs, where a group is an iterator
    # containing ['field0', 'field1', year] values were the field0 and field1
    # strings are the same respectively
    # the min_and_count function converts such a group into count and min pair
    def min_and_count(group):
        i, min_year = 0, 99999
        for _, _, year in group:
            i += 1
            min_year = year if year < min_year else min_year
        return (i, min_year)

    yield from map(lambda x: x[0] + [min_and_count(x[1])], groups)


with open("test.srt") as fp:
    # this reads the lines in a lazy fashion and filter empty lines out
    lines = filter(bool, csv.reader(fp, delimiter=' '))
    # convert the last value to integer (still in a lazy fashion)
    lines = map(lambda line: [line[0], line[1], int(line[2])], lines)
    # write result to another file
    with open("result_file", "w") as rf:
        for record in count_duplicate(lines):
            rf.write(str(record) + '\n')

NB:此解决方案是一个Python 3.x解决方案,其中filtermap返回迭代器而不是list(s),就像它们在Python中一样2.X

答案 2 :(得分:2)

<强>解决方案:

#!/usr/bin/env python


def readdata(filename):
    last = []
    count = 0

    with open(filename, "r") as fd:
        for line in fd:
            tokens = line.strip().split()
            tokens[2] = int(tokens[2])

            if not last:
                last = tokens

            if tokens[:2] != last[:2]:
                yield last[:2], count or 1, last[2]
                last = tokens
                count = 1
            else:
                count += 1

            tokens[2] = min(tokens[2], last[2])

        yield last[:2], count, last[2]


with open("output.txt", "w") as fd:
    for words, count, year in readdata("data.txt"):
        fd.write(
            "{0:s} {1:s} ({2:d} {3:d})\n".format(
                words[0], words[1], count, year
            )
        )

<强>输出:

D000001 D000001 (2 1975)
D000001 D002413 (3 1976)
D000001 D004298 (1 1976)
D000002 D000002 (1 1985)
D000003 D000900 (2 1975)
D000003 D004134 (2 1983)

<强>讨论:

  • 这会迭代地读取和处理数据( Python 2.x ),因此它不会将所有内容读入内存,从而允许处理非常大的数据文件。
  • 只要输入数据已排序,也不需要复杂的数据结构。我们只需要跟踪最后一组令牌并跟踪每组“重复”的最小年份。

该算法实际上与itertools.groupby非常相似(请参阅使用此算法的其他答案,但假设使用Python 3.x )。

值得注意的是,这个实现也是“O(n`)”( Big O )。

答案 3 :(得分:2)

TXR

@(repeat)
@dleft @dright @num
@  (collect :gap 0 :vars (dupenum))
@dleft @dright @dupenum
@  (end)
@  (output)
@dleft @dright @(+ 1 (length dupenum)) @num
@  (end)
@(end)

执行命令

$ txr data.txr data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983

AWK:

$1 != old1 || $2 != old2 { printf("%s", out);
                           count = 0
                           old1 = $1
                           old2 = $2
                           old3 = $3 }

                         { out = $1 " " $2 " " ++count " " old3 "\n" }

END                      { printf("%s", out); }

$ awk -f data.awk data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983

TXR Lisp功能单行:

$ txr -t '[(opip (mapcar* (op split-str @1 " "))
                 (partition-by [callf list first second])
                 (mapcar* (aret `@[@1 0..2] @(+ 1 (length @rest)) @[@1 2]`)))
           (get-lines)]' < data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983