Question

我有一个非常大的文件，如GB，它有4列。从那里我必须找到前2列的出现次数。

Col[1] Col[2] Col[3] Col[4]

所以我必须考虑Col[1] and Col[2]

中的对

我必须在整个文件中找到该特定对的出现次数

例如：

Col[1] Col[2]
1234   5678
8901   3456
1234   5678
0987   2345
1234   5678

所以我们看到1234 5678到目前为止已经发生了3次。

我确实从其他帖子中引用了一些代码，并试图用我的数据文件实现并发现一些错误。

from itertools import combinations
from collections import Counter
import ast

 def collect_pairs('FileName.txt'):

    pair_counter = Counter()
    for line in open('FileName.txt'):
      unique_tokens = sorted(set(ast.literal_eval(lines)))
      combos = combination(unique_token, 2)
      pair_counter += Counter(combos)
    return pair_counter
    outfile = open('Outputfile.txt', 'w')
    p = collect_pairs(outfile)
    print p.most_common(10)

Answer 1

我建议使用defaultdict并逐行阅读文件。

from collections import defaultdict
d = defaultdict(int)

# get number of occurences for the first two columns
with open('file', 'r') as f:
    f.readline() # discard the header line
    for numlines, line in enumerate(f,1):
        line = line.strip().split()
        c = line[0], line[1]
        d[c] += 1

# compute 100*(occurences/numlines) for each key in d
d = {k:(v, 100*float(v)/numlines) for k,v in d.iteritems()}
for k in d:
    print k, d[k]

对于您的示例文件，将打印：

('0987', '2345') (1, 20.0)
('8901', '3456') (1, 20.0)
('1234', '5678') (3, 60.0)

格式为(column1, column2) (occurrences, percentage)。

如果您只需要一对的出现次数，例如'1234'和'5678'，你可以这样做：

find = '1234', '5678'
counter = 0
with open('file', 'r') as f:
    f.readline() # discard the header line
    for numlines, line in enumerate(f,1):
        line = line.strip().split()
        c = line[0], line[1]
        if c == find:
            counter += 1

print counter, 100*float(counter)/numlines

示例文件的输出：

3 60.0

我一直认为在计算百分比值时标题行不计算在内。如果确实有效，请将enumerate(f,1)更改为enumerate(f,2)。

在Python中查找出现次数和百分比

1 个答案: