从没有分隔符的文本文件中读取数字数组

时间:2011-05-27 09:37:27

标签: python

我正在尝试从文本文件中读取一些数字数据,但我正在努力阅读没有任何分隔符存储的数字。文件格式本身是一种在世界各地的许多代码中使用的相当标准的格式,因此无法更改。以下是示例文件头部的片段:

SOME TEXT OF A FIXED LENGTH      33
 3.192839854E+00 3.189751983E+00 3.186795271E+00 3.183874776E+00 3.180986976E+00
 3.178133610E+00 3.175318116E+00 3.172544681E+00 3.169818171E+00 3.167143271E+00
 3.164524875E+00 3.161968464E+00 3.159479193E+00 3.157062171E+00 3.154723040E+00
 3.152466964E+00 3.150299067E+00 3.148224863E+00 3.146249721E+00 3.144379226E+00
 3.142619004E+00 3.140974218E+00 3.139450283E+00 3.138052814E+00 3.136786929E+00
 3.135657986E+00 3.134671499E+00 3.133833067E+00 3.133149899E+00 3.132631559E+00
 3.132282773E+00 3.132080343E+00 3.131954939E+00
-5.487648393E-01-5.476736110E-01-5.447693831E-01-5.405765060E-01-5.353610408E-01
-5.291415409E-01-5.219573970E-01-5.137449740E-01-5.045337620E-01-4.943949468E-01
-4.832213992E-01-4.710109577E-01-4.578747780E-01-4.436967869E-01-4.285062978E-01
-4.123986122E-01-3.952894227E-01-3.771859951E-01-3.580934057E-01-3.379503384E-01
-3.168282028E-01-2.947799605E-01-2.716835737E-01-2.476267515E-01-2.226373818E-01
-1.966313850E-01-1.696421504E-01-1.415353640E-01-1.118510940E-01-8.041086734E-02
-4.968321601E-02-2.772555484E-02-2.631111359E-02
....

第一行包含一些注释(固定长度),后跟一个整数,该整数给出后面数组的长度。数组本身存储为固定宽度的数字列表。在这种情况下,第一个数组不应该给我带来任何问题。但是,正如您从第二个数组中看到的那样,所有数字都是负数,因此数字之间没有空格。因此,str.split()等方法不会返回数字列表。如果有关如何最好地处理此文件的任何建议,我将不胜感激。

最后一点可能很重要的信息:数组本身包含换行符,即下面的代码

with open('some_file') as fh:
    data = [line for line in fh]

npts = int(data.pop(0).split()[-1])
print data

返回:

[' 3.192839854E+00 3.189751983E+00 3.186795271E+00 3.183874776E+00 3.180986976E+00\n',
 ' 3.178133610E+00 3.175318116E+00 3.172544681E+00 3.169818171E+00 3.167143271E+00\n',
 ' 3.164524875E+00 3.161968464E+00 3.159479193E+00 3.157062171E+00 3.154723040E+00\n',
 ' 3.152466964E+00 3.150299067E+00 3.148224863E+00 3.146249721E+00 3.144379226E+00\n',
 ' 3.142619004E+00 3.140974218E+00 3.139450283E+00 3.138052814E+00 3.136786929E+00\n',
 ' 3.135657986E+00 3.134671499E+00 3.133833067E+00 3.133149899E+00 3.132631559E+00\n',
 ' 3.132282773E+00 3.132080343E+00 3.131954939E+00\n', 
 '-5.487648393E-01-5.476736110E-01-5.447693831E-01-5.405765060E-01-5.353610408E-01\n',
 '-5.291415409E-01-5.219573970E-01-5.137449740E-01-5.045337620E-01-4.943949468E-01\n',
 '-4.832213992E-01-4.710109577E-01-4.578747780E-01-4.436967869E-01-4.285062978E-01\n',
 '-4.123986122E-01-3.952894227E-01-3.771859951E-01-3.580934057E-01-3.379503384E-01\n',
 '-3.168282028E-01-2.947799605E-01-2.716835737E-01-2.476267515E-01-2.226373818E-01\n',
 '-1.966313850E-01-1.696421504E-01-1.415353640E-01-1.118510940E-01-8.041086734E-02\n',
 '-4.968321601E-02-2.772555484E-02-2.631111359E-02\n', ... ]

希望这是相对清楚的 - 如果您需要有关文件格式的更多信息,请告诉我。

4 个答案:

答案 0 :(得分:3)

  

数组本身存储为   固定宽度的数字列表。

由于每个条目的宽度正好是十六个字符,因此以下内容会将输入文件的一行转换为浮点数列表:

In [1]: line = '-5.487648393E-01-5.476736110E-01-5.447693831E-01-5.405765060E-01-5.353610408E-01'

In [2]: [float(line[i:i+16]) for i in xrange(0, len(line), 16)]
Out[2]: 
[-0.54876483929999997,
 -0.547673611,
 -0.5447693831,
 -0.54057650599999996,
 -0.53536104080000002]

在这里,我假设该行不包含尾随换行符;如果可能,可以先使用str.rstrip将其删除。以下代码段还演示了如何将数字序列拆分为n的块(请注意,它不会尝试解析标题行):

n = 33
arr = []
for line in open('data.txt'):
  line = line.rstrip('\n')
  arr.extend(float(line[i:i+16]) for i in xrange(0, len(line), 16))
  if len(arr) >= n:
    print arr[:n]
    arr = arr[n:]

答案 1 :(得分:1)

Chris,在这种情况下,您应该使用f.read(size)来读取数字后的数字。

这应该给你一个想法。还要确保您将原始样本文件发布到网络上,以便我们可以使用它进行测试,在wiki中复制和粘贴可能会破坏它们的格式。

def split_len(seq, length):
    return [seq[i:i+length] for i in range(0, len(seq), length)]

f = open("sample.txt")

header = f.readline()
(a,b,size) = header.rpartition(' ')
size = int(size)
lines = f.readlines()
found = 0
for line in lines:
    for number in split_len(line.rstrip(), 16):
        found = found + 1
        print(number)
        if found==size:
            break

答案 2 :(得分:0)

一些伪代码:

Loop though the line-as-string one character at a time. 
  |-> A. Add each character to a buffer. 
  |-> B. If you hit a space or hyphen character, treat either as a delimiter.
  |---> Add your buffered string to an array of numbers.
  |-> C. Reset buffer.
  |-> D. Repeat A. through C. until you hit a newline character.

答案 3 :(得分:0)

使用正则表达式怎么样?以下应该肯定有用:

>>> import re
>>> ...
>>> data = ' '.join([e[:-1] for e in data]
>>> numbers = re.findall(r'[ \-]\d+\.\d+E[+\-]\d+',data)
>>> numbers
[' 3.192839854E+00', ' 3.189751983E+00', ' 3.186795271E+00', ' 3.183874776E+00', ' ...  
>>> map(float,numbers)
[3.1928398539999998, 3.1897519829999998, 3.1867952709999998, 3.1838747760000001, ...