从非CSV文件中读取数据

时间:2015-07-22 17:38:48

标签: python csv file-io

我的文本文件中的数据如下所示:

2,20 12,40 13,100 14,300
15,440 16,10 24,50 25,350
26,2322 27,3323 28,9999 29,2152
30,2622 31,50

我想在Python中将这些数据读入两个不同的列表。但是,这不是CSV文件。数据读取如下: mass1,intensity1 mass2,intensity2 mass3,intensity3...

我应该如何将群众和强度读入两个不同的名单?我试图避免编写此文件以使数据更整洁和/或以CSV格式。

4 个答案:

答案 0 :(得分:5)

看起来您可以line.split()每行分隔各个对,然后使用pair.split(",")分隔每对中的质量和强度。

答案 1 :(得分:1)

mass_results = []
intensity_results = []

with open('in.txt', 'r') as f:
    for line in f:
        for readings in line.split(' '):
            mass, intensity = readings.split(',')
            mass_results.append(int(mass.strip()))
            intensity_results.append(int(intensity.strip()))

print('Mass values:')
print(mass_results)
print('Intensity values:')
print(intensity_results)

收率:

Mass values:
[2, 12, 13, 14, 15, 16, 24, 25, 26, 27, 28, 29, 30, 31]
Intensity values:
[20, 40, 100, 300, 440, 10, 50, 350, 2322, 3323, 9999, 2152, 2622, 50]

答案 2 :(得分:0)

import re

# read the file
f = open('input.dat','r')
data = f.read()
f.close()

# grab mass and intensity values using regex
m_re = '[0-9]+(?=,[0-9]+)'
i_re = '(?<=[0-9],)[0-9]+'
mass = re.findall(m_re,data)
intensity = re.findall(i_re,data)

# view results
print "Mass values:", mass
print "Intensity values:", intensity
print "(Mass,Intensity):", zip(mass,intensity)

如果您提到的25行标题与正则表达式匹配并改变结果,您可以尝试用以下内容替换上面的文件输入部分:

# read the file
f = open('input.dat','r')
lines = f.readlines()[25:] # ignore first 25 lines
f.close()
data = ' '.join(lines)

答案 3 :(得分:0)

假设输入文件类似于

#this is header
#this is header
#this is header
2,20 12,40 13,100 14,300
15,440 16,10 24,50 25,350
26,2322 27,3323 28,9999 29,2152
30,2622 31,50

您可以使用re

<方法1

如果文件非常大

import re

def xy_parser( fname, header_len=3):
    with open( fname) as f:
        for i,line in enumerate(f):
            if i < header_len:
                continue
            else:
                yield re.findall( '[0-9]+,[0-9]+', line)

def xy_maker( xy_str):
    return map( float, xy_str.split(',') )

my_xys = []
for xys in xy_parse( 'xydata.txt'):
    my_xys += [ xy_maker(val) for val in xys  ]
my_xys 
#[[2.0, 20.0],
# [12.0, 40.0],
# [13.0, 100.0],
# [14.0, 300.0],
# [15.0, 440.0],
# [16.0, 10.0],
# [24.0, 50.0],
# [25.0, 350.0],
# [26.0, 2322.0],
# [27.0, 3323.0],
# [28.0, 9999.0],
# [29.0, 2152.0],
# [30.0, 2622.0],
# [31.0, 50.0]]
<方法2

我还想指出,如果文件不是太大,那么一次性阅读

f = open('xydata.txt', 'r')
header_len = 3
for i in xrange(header_len): # skip the header lines
    f.readline()
data_str = f.read().replace('\n','') # read from current file pos to end of file and replace new line chars

data_xy_str = re.findall( '[0-9]+,[0-9]+', data_str)
my_xys      = [ xy_maker(xy_str) for xy_str in data_xy_str ]
# yields the same result as above