
时间:2014-07-19 04:48:30

标签: python algorithm optimization duplicates text-processing



D000001 D000001 1975
D000001 D000001 1976
D000001 D002413 1976
D000001 D002413 1979
D000001 D002413 1987
D000001 D004298 1976
D000002 D000002 1985
D000003 D000900 1975
D000003 D000900 1990
D000003 D004134 1983
D000003 D004134 1986


D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983


  1. 文件很大(1 GB到5 GB),我想知道在这个设置中最合适的编程结构是什么?
  2. 如何正确打印最后一个(第3列)?在当前设置(下面的检查代码)中,程序打印最后一个(最高)值。
  3. 我对当前输出的初始尝试如下。

    my_dict = {}
    with open('test.srt') as infile:
      for line in infile:
        line = line.rstrip()
        word1, word2, year = line.split('|')
        year = int(year)
        my_tuple = (word1, word2)
        if my_tuple in my_dict:
          freq += 1
          my_dict[my_tuple] = (freq, year)
          freq = 1
          my_dict[my_tuple] = (freq, year)
    for key, value in my_dict.items():
      print key[0], key[1], value


    D000001 D000001 (2, 1976) ## Should be 1976 etc.
    D000001 D002413 (3, 1987)
    D000001 D004298 (1, 1976)
    D000002 D000002 (1, 1985)
    D000003 D000900 (2, 1990)
    D000003 D004134 (2, 1986)

4 个答案:

import csv
from itertools import groupby

def count_duplicate(it):
    # group by frist two fields
    groups = groupby(it, lambda line: line[:2])
    # this will produce (key, group) pairs, where a group is an iterator
    # containing ['field0', 'field1', year] values were the field0 and field1
    # strings are the same respectively
    # the min_and_count function converts such a group into count and min pair
    def min_and_count(group):
        i, min_year = 0, 99999
        for _, _, year in group:
            i += 1
            min_year = year if year < min_year else min_year
        return (i, min_year)

    yield from map(lambda x: x[0] + [min_and_count(x[1])], groups)

with open("test.srt") as fp:
    # this reads the lines in a lazy fashion and filter empty lines out
    lines = filter(bool, csv.reader(fp, delimiter=' '))
    # convert the last value to integer (still in a lazy fashion)
    lines = map(lambda line: [line[0], line[1], int(line[2])], lines)
    # write result to another file
    with open("result_file", "w") as rf:
        for record in count_duplicate(lines):
            rf.write(str(record) + '\n')

NB:此解决方案是一个Python 3.x解决方案,其中filtermap返回迭代器而不是list(s),就像它们在Python中一样2.X

答案 2 :(得分:2)


#!/usr/bin/env python

def readdata(filename):
    last = []
    count = 0

    with open(filename, "r") as fd:
        for line in fd:
            tokens = line.strip().split()
            tokens[2] = int(tokens[2])

            if not last:
                last = tokens

            if tokens[:2] != last[:2]:
                yield last[:2], count or 1, last[2]
                last = tokens
                count = 1
                count += 1

            tokens[2] = min(tokens[2], last[2])

        yield last[:2], count, last[2]

with open("output.txt", "w") as fd:
    for words, count, year in readdata("data.txt"):
            "{0:s} {1:s} ({2:d} {3:d})\n".format(
                words[0], words[1], count, year


D000001 D000001 (2 1975)
D000001 D002413 (3 1976)
D000001 D004298 (1 1976)
D000002 D000002 (1 1985)
D000003 D000900 (2 1975)
D000003 D004134 (2 1983)


  • 这会迭代地读取和处理数据( Python 2.x ),因此它不会将所有内容读入内存,从而允许处理非常大的数据文件。
  • 只要输入数据已排序,也不需要复杂的数据结构。我们只需要跟踪最后一组令牌并跟踪每组“重复”的最小年份。

该算法实际上与itertools.groupby非常相似(请参阅使用此算法的其他答案,但假设使用Python 3.x )。

值得注意的是,这个实现也是“O(n`)”( Big O )。

答案 3 :(得分:2)


@dleft @dright @num
@  (collect :gap 0 :vars (dupenum))
@dleft @dright @dupenum
@  (end)
@  (output)
@dleft @dright @(+ 1 (length dupenum)) @num
@  (end)


$ txr data.txr data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983


$1 != old1 || $2 != old2 { printf("%s", out);
                           count = 0
                           old1 = $1
                           old2 = $2
                           old3 = $3 }

                         { out = $1 " " $2 " " ++count " " old3 "\n" }

END                      { printf("%s", out); }

$ awk -f data.awk data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983

TXR Lisp功能单行:

$ txr -t '[(opip (mapcar* (op split-str @1 " "))
                 (partition-by [callf list first second])
                 (mapcar* (aret `@[@1 0..2] @(+ 1 (length @rest)) @[@1 2]`)))
           (get-lines)]' < data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983