Python - 计算csv / tsv文件中的字段密度

时间:2012-09-04 21:47:35

标签: python csv mapreduce field distributed-computing

嗨,假设我有一个像这样的制表符分隔文件(每个字段用制表符空格分隔):

Name    ID    Country    GPA
Tom    id1    USA        3.4
Jon    id2    Canada    
Amy           UK         3.0
Kevin  id4    Scotland    
Kris                     3.1

这里名称密度= 1.0即100% ID密度为0.6,即60%(缺少2个字段) 国家的密度是0.8 GPA的密度也是0.6

如何使用python查找文件?此外,我需要一个高效快速的算法,因为我需要为数千个价值超过40 GB的文件执行此操作。地图缩减代码也有效 在此先感谢:)

2 个答案:

答案 0 :(得分:1)

from collections import Counter
from itertools import izip
import csv

with open(filename, 'rb') as f:
    reader = csv.reader(f, delimiter='\t')
    keys = next(reader)
    counts = Counter()
    for i, row in enumerate(reader):
        counts.update(k for k, v in izip(keys, row) if v)
    line_count = i + 1
    for k in keys:
        print k, 'density:', 1.0 * counts[k] / line_count

答案 1 :(得分:0)

f = open(name,'r')
head = f.readline().strip().split('\t')
num = 0
has = [0]*len(head)
for line in f:
    num += 1
    for (i,x) in enumerate(line.strip().split('\t')):
        if x:
            has[i] += 1

print head
print [float(x)/num for x in has]