我在gzip中有一个大矩阵,看起来像这样:
locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0
locus_2 mark2 0.0,0.0,0.0,0.0,0.0,0.5536,0.9177,0.2929,0.0,0.0
locus_3 mark2 0.0,0.0,0.1,0.0,0.0,0.9536,0.8177,0.2827,0.0,0.0
因此,每行以两个描述符开头,后跟10个值。
我只是想解析这一行的前5个值,这样我就有了这样的矩阵:
locus_1 mark1 0.0,0.0,0.0,0.0,0.0
locus_2 mark2 0.0,0.0,0.0,0.0,0.0
locus_3 mark2 0.0,0.0,0.1,0.0,0.0
我已经制作了以下python脚本来解析它,但无济于事:
import gzip
import numpy as np
inFile = gzip.open('/home/anish/data.gz')
inFile.next()
for line in inFile:
cols = line.strip().replace('nan','0').split('\t')
data = cols[2:]
data = map(float,data)
gfpVals = data[:5]
print '\t'.join(cols[:6]) + '\t' + '\t'.join(map(str,gfpVals))
我只是得到错误:
data = map(float,data)
ValueError: could not convert string to float:
答案 0 :(得分:2)
您只使用制表符作为分隔符,而值也以逗号分隔。
结果
locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0
分为
locus_1 || mark1 || 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0
你正在传递浮动字符串
"0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0"
这是一个无效的文字。
你应该替换:
data = cols[2:]
与
data = cols[2:].split(',')