在python中从平面文件创建指纹文件

时间:2013-07-04 12:42:16

标签: python

我有另一个新手python问题。我有一个文件如下所示。我需要将其转换为矢量和指纹形式。对我来说,问题是如何组合文件,所以在最后我有矩阵,其中行是cmps,列是val ...如果comp缺少val,则等于零。 cmp的val是不同的,重叠不是很大。你能告诉我哪里更好吗? Python词典?任何想法都有帮助感谢的!

cmp1    0.277   val_1
cmp1    0.097   val_2
cmp1    0.795   val_3
cmp1    0.809   val_4
cmp1    0.127   val_5
cmp2    0.839   val_3
cmp2    0.909   val_4
cmp2    0.148   val_5
cmp2    0.938   val_6
cmp2    0.599   val_7

我收到的结果......

矢量版

name    val_1   val_2   val_3   val_4   val_5   val_6   val_7
cmp1    0.277   0.097   0.795   0.809   0.127   0   0
cmp2    0   0   0.839   0.909   0.148   0.938   0.599   

二进制版

name    val_1   val_2   val_3   val_4   val_5   val_6   val_7
cmp1    0   0   1   1   0   0   0
cmp2    0   0   1   1   0   1   1

当前代码

import csv

fi = open("data.txt", "rb")
fo = open("data_out.txt", "wb")
reader = csv.reader(fi,delimiter='\t')
writer = csv.writer(fo,delimiter='\t')

# making unique lists
targets = set()
ligands = set()

for row in reader:
    ligands.add(row[0])
    targets.add(row[2])

data = []
for row in reader:
    if row[0] in ligands and row[2] in targets:
    else: 

1 个答案:

答案 0 :(得分:2)

您可以在此处使用collections.defaultdict

from collections import defaultdict
with open('abc') as f:
    dic = defaultdict(dict)
    for line in f:
        cmp, val, col = line.split()
        dic[cmp][col] = val
print dic
# defaultdict(<type 'dict'>,
 #{'cmp1': {'val_5': '0.127', 'val_4': '0.809', 'val_1': '0.277', 'val_3': '0.795', 'val_2': '0.097'},
 # 'cmp2': {'val_5': '0.148', 'val_4': '0.909', 'val_7': '0.599', 'val_6': '0.938', 'val_3': '0.839'}})

#get a sroted list of all val_i from the dic        
vals = sorted(set(y for x in dic.itervalues() for y in x))

keys = sorted(dic)
print "name    {}".format("\t".join(vals))
for key in keys:
    print "{}    {}".format(key, "\t".join(dic[key].get(v,'0')  for v in vals)  )

<强>输出:

name    val_1   val_2   val_3   val_4   val_5   val_6   val_7
cmp1    0.277   0.097   0.795   0.809   0.127   0   0
cmp2    0   0   0.839   0.909   0.148   0.938   0.599

对于二进制版本,您可以尝试:

print "name    {}".format("\t".join(vals))
for key in keys:
    strs = "\t".join(str(int(round(float(dic[key][v])))) if v in dic[key] else '0'  for v in vals)
    print "{}    {}".format(key, strs )

<强>输出:

name    val_1   val_2   val_3   val_4   val_5   val_6   val_7
cmp1    0   0   1   1   0   0   0
cmp2    0   0   1   1   0   1   1