问题陈述
我有以下文件(文件按照所有三列排序):
D000001 D000001 1975
D000001 D000001 1976
D000001 D002413 1976
D000001 D002413 1979
D000001 D002413 1987
D000001 D004298 1976
D000002 D000002 1985
D000003 D000900 1975
D000003 D000900 1990
D000003 D004134 1983
D000003 D004134 1986
我需要计算重复对(在第1列和第2列中),并且每个这样的对分配第3列中的最低值。对于我的玩具文件,输出应为:
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983
我的问题
我对当前输出的初始尝试如下。
my_dict = {}
with open('test.srt') as infile:
for line in infile:
line = line.rstrip()
word1, word2, year = line.split('|')
year = int(year)
my_tuple = (word1, word2)
if my_tuple in my_dict:
freq += 1
my_dict[my_tuple] = (freq, year)
else:
freq = 1
my_dict[my_tuple] = (freq, year)
for key, value in my_dict.items():
print key[0], key[1], value
当前输出:
D000001 D000001 (2, 1976) ## Should be 1976 etc.
D000001 D002413 (3, 1987)
D000001 D004298 (1, 1976)
D000002 D000002 (1, 1985)
D000003 D000900 (2, 1990)
D000003 D004134 (2, 1986)
答案 0 :(得分:3)
由于文件很大,因此不应使用内存中的字典来管理数据。开始读取源文件并将结果直接输出到目标文件,您真正需要的只有3个变量,
一个用于存储当前元组,第二个用于存储计数,第三个用于存储最高值。当元组更改时,将值写入输出文件并继续。
这个内存占用很少,并且可以处理疯狂的大文件。但当然这只会因为你的元组被排序而起作用。
答案 1 :(得分:3)
Groupby和generator的发展方向:
import csv
from itertools import groupby
def count_duplicate(it):
# group by frist two fields
groups = groupby(it, lambda line: line[:2])
# this will produce (key, group) pairs, where a group is an iterator
# containing ['field0', 'field1', year] values were the field0 and field1
# strings are the same respectively
# the min_and_count function converts such a group into count and min pair
def min_and_count(group):
i, min_year = 0, 99999
for _, _, year in group:
i += 1
min_year = year if year < min_year else min_year
return (i, min_year)
yield from map(lambda x: x[0] + [min_and_count(x[1])], groups)
with open("test.srt") as fp:
# this reads the lines in a lazy fashion and filter empty lines out
lines = filter(bool, csv.reader(fp, delimiter=' '))
# convert the last value to integer (still in a lazy fashion)
lines = map(lambda line: [line[0], line[1], int(line[2])], lines)
# write result to another file
with open("result_file", "w") as rf:
for record in count_duplicate(lines):
rf.write(str(record) + '\n')
NB:此解决方案是一个Python 3.x解决方案,其中filter
和map
返回迭代器而不是list
(s),就像它们在Python中一样2.X
答案 2 :(得分:2)
<强>解决方案:强>
#!/usr/bin/env python
def readdata(filename):
last = []
count = 0
with open(filename, "r") as fd:
for line in fd:
tokens = line.strip().split()
tokens[2] = int(tokens[2])
if not last:
last = tokens
if tokens[:2] != last[:2]:
yield last[:2], count or 1, last[2]
last = tokens
count = 1
else:
count += 1
tokens[2] = min(tokens[2], last[2])
yield last[:2], count, last[2]
with open("output.txt", "w") as fd:
for words, count, year in readdata("data.txt"):
fd.write(
"{0:s} {1:s} ({2:d} {3:d})\n".format(
words[0], words[1], count, year
)
)
<强>输出:强>
D000001 D000001 (2 1975)
D000001 D002413 (3 1976)
D000001 D004298 (1 1976)
D000002 D000002 (1 1985)
D000003 D000900 (2 1975)
D000003 D004134 (2 1983)
<强>讨论:强>
该算法实际上与itertools.groupby非常相似(请参阅使用此算法的其他答案,但假设使用Python 3.x )。
值得注意的是,这个实现也是“O(n`)”( Big O )。
答案 3 :(得分:2)
TXR:
@(repeat)
@dleft @dright @num
@ (collect :gap 0 :vars (dupenum))
@dleft @dright @dupenum
@ (end)
@ (output)
@dleft @dright @(+ 1 (length dupenum)) @num
@ (end)
@(end)
执行命令
$ txr data.txr data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983
AWK:
$1 != old1 || $2 != old2 { printf("%s", out);
count = 0
old1 = $1
old2 = $2
old3 = $3 }
{ out = $1 " " $2 " " ++count " " old3 "\n" }
END { printf("%s", out); }
$ awk -f data.awk data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983
TXR Lisp功能单行:
$ txr -t '[(opip (mapcar* (op split-str @1 " "))
(partition-by [callf list first second])
(mapcar* (aret `@[@1 0..2] @(+ 1 (length @rest)) @[@1 2]`)))
(get-lines)]' < data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983