嗨,假设我有一个像这样的制表符分隔文件(每个字段用制表符空格分隔):
Name ID Country GPA
Tom id1 USA 3.4
Jon id2 Canada
Amy UK 3.0
Kevin id4 Scotland
Kris 3.1
这里名称密度= 1.0即100% ID密度为0.6,即60%(缺少2个字段) 国家的密度是0.8 GPA的密度也是0.6
如何使用python查找文件?此外,我需要一个高效快速的算法,因为我需要为数千个价值超过40 GB的文件执行此操作。地图缩减代码也有效 在此先感谢:)
答案 0 :(得分:1)
from collections import Counter
from itertools import izip
import csv
with open(filename, 'rb') as f:
reader = csv.reader(f, delimiter='\t')
keys = next(reader)
counts = Counter()
for i, row in enumerate(reader):
counts.update(k for k, v in izip(keys, row) if v)
line_count = i + 1
for k in keys:
print k, 'density:', 1.0 * counts[k] / line_count
答案 1 :(得分:0)
f = open(name,'r')
head = f.readline().strip().split('\t')
num = 0
has = [0]*len(head)
for line in f:
num += 1
for (i,x) in enumerate(line.strip().split('\t')):
if x:
has[i] += 1
print head
print [float(x)/num for x in has]