使用python解析大型数据集

时间:2017-03-10 18:33:34

标签: python parsing

我在gzip中有一个大矩阵,看起来像这样:

locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0
locus_2 mark2 0.0,0.0,0.0,0.0,0.0,0.5536,0.9177,0.2929,0.0,0.0
locus_3 mark2 0.0,0.0,0.1,0.0,0.0,0.9536,0.8177,0.2827,0.0,0.0

因此,每行以两个描述符开头,后跟10个值。

我只是想解析这一行的前5个值,这样我就有了这样的矩阵:

locus_1 mark1 0.0,0.0,0.0,0.0,0.0
locus_2 mark2 0.0,0.0,0.0,0.0,0.0
locus_3 mark2 0.0,0.0,0.1,0.0,0.0

我已经制作了以下python脚本来解析它,但无济于事:

import gzip
import numpy as np

inFile = gzip.open('/home/anish/data.gz')

inFile.next()

for line in inFile:
        cols = line.strip().replace('nan','0').split('\t')
        data = cols[2:]
        data = map(float,data)

        gfpVals =  data[:5]

        print '\t'.join(cols[:6]) + '\t' + '\t'.join(map(str,gfpVals))

我只是得到错误:

data = map(float,data)
ValueError: could not convert string to float: 

1 个答案:

答案 0 :(得分:2)

您只使用制表符作为分隔符,而值也以逗号分隔。

结果

locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0

分为

locus_1 || mark1 || 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0

你正在传递浮动字符串

"0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0"

这是一个无效的文字。

你应该替换:

 data = cols[2:]

 data = cols[2:].split(',')