我有一个制表符分隔的数据文本文件,格式如下:
Depth Temp Salinity
0.30 28.30 31.90
0.30 28.30 31.90
0.30 28.20 31.90
0.30 28.20 31.90
0.40 28.20 32.00
0.40 28.00 32.00
0.50 28.00 31.90
0.60 28.00 32.00
0.70 27.90 32.00
0.60 27.90 32.10
我想要实现的是获取Depth列中存在重复值并将它们放入list / s中的任何行。然后从该列表中我将为每列平均这些值(不平均深度列),按深度排序值,然后将所有这些值输出回原始数据文件格式。所以在上面的示例文件中,输出将是:
Depth Temp Salinity
0.30 28.25 31.90
0.40 28.10 32.00
0.50 28.00 31.90
0.60 27.95 32.05
0.70 27.90 32.00
我知道我需要使用.readlines()来获取相关的行,但是我怎么只抓取重复的行?
提前致谢!
答案 0 :(得分:1)
您应该使用字键,其中键是深度。
lines = [
"0.30 28.30 31.90",
"0.30 28.30 31.90",
"0.30 28.20 31.90",
"0.30 28.20 31.90",
"0.40 28.20 32.00",
"0.40 28.00 32.00",
"0.50 28.00 31.90",
"0.60 28.00 32.00",
"0.70 27.90 32.00",
"0.60 27.90 32.10"
]
dict = {}
for line in lines:
depth, temp, salinity = map(float, line.split())
old = (0,0,0)
if depth in dict: old = dict[depth]
dict[depth] = (old[0]+1, old[1]+temp, old[2]+salinity)
for key in dict:
tri = dict[key]
print(str(key) +" "+str(tri[1]/tri[0])+" "+str(tri[2]/tri[0]))
答案 1 :(得分:1)
如果您能够将整个文件读入内存,itertools.groupby可能会简化您的代码:
from itertools import groupby
lines = [map(float, line.split("\t")) for line in open('file.txt')]
print lines[0].strip() # print out header
key_fun = lambda(x):x[0]
sorted_lines = sorted(lines[1:], key=key_fun)
for k,g in groupby(sorted_lines, key=key_fun):
g = list(g)
mean_temp = sum(x[1] for x in g) / len(g)
mean_salinity = sum(x[2] for x in g) / len(g)
print "%f\t%f\t%f" % (k,mean_temp,mean_salinity)
答案 2 :(得分:0)
使用numpy可以简化计算:
import numpy as np
with file("data.txt", "rb") as f:
titles = f.readline().strip().split()
data = np.loadtxt(f)
data = data[np.argsort(data[:, 0])]
split_index = np.where(np.diff(data[:,0])>0)[0]+1
print "\t".join(titles)
for a in np.split(data, split_index):
print "\t".join("%f" % x for x in np.average(a, axis=0))