Python将格式不一致的文本加载到数组中

时间:2013-09-10 14:15:09

标签: python arrays text numpy

我需要能够读取文件并将其导入Python。导致问题的原因是文件不一致。这就是文件中的内容:

             -0.2066687680781E-01 0.4329528510571E+00-0.9011796712875E+00
             -0.4119676724076E-01 0.4006276726723E+00-0.9153143167496E+00
              0.1022378727794E+00 0.2991854846478E+00-0.9487020373344E+00
              0.2066854201257E-01 0.3005275726318E+00-0.9535492062569E+00
              0.4130198806524E-01 0.3341401219368E+00-0.9416180849075E+00
              0.6145291402936E-01 0.3000802397728E+00-0.9519324898720E+00
              0.8211978524923E-01 0.3335199654102E+00-0.9391596317291E+00
              0.6186530366540E-01 0.3671853244305E+00-0.9280881881714E+00
             -0.2066862955689E-01 0.3678680062294E+00-0.9296482801437E+00
              0.2066862955689E-01 0.3678680062294E+00-0.9296482801437E+00
              0.0000000000000E+00 0.3344254791737E+00-0.9424222111702E+00
              0.5163235664368E+00-0.3289847448468E-01-0.8557614684105E+00
              0.5062980055809E+00-0.6575757265091E-01-0.8598478436470E+00
              0.4863796830177E+00-0.3290597721934E-01-0.8731277585030E+00
              0.4844416379929E+00-0.1312004029751E+00-0.8649293184280E+00
              0.4652865529060E+00-0.9858986735344E-01-0.8796525001526E+00
              0.4453650414944E+00-0.6581693142653E-01-0.8929267525673E+00
              0.4761176705360E+00-0.6582681834698E-01-0.8769143819809E+00    

大多数情况下,数字被分为三列,但如果它是负数,则没有空格,并且在将其加载到Python时会导致错误。这是我用来加载文件的内容:

from numpy import *
import numpy as np
sphere = np.loadtxt("sphererad1.out")

这是我得到的错误:

File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/lib/npyio.py", line 827, in loadtxt
items = [conv(val) for (conv, val) in zip(converters, vals)]
ValueError: invalid literal for float(): 0.2899294197559E+00-0.1325698643923E+00

我无法重新生成数据,因此我必须弄清楚如何将其导入Python。 我尝试使用以下方法导入Python:

opn = open("sphererad1.out")
sphere = opn.readlines()
opn.close() 

为了测试将其分解为每个数字,我尝试了这个:

l = sphere[2000]
n = 18
[l[i:i+n] for i in range(0, len(l), n)]

我得到了

['             -0.24', '73256886005E+00-0.', '6656686961651E-01-', '0.9666430950165E+0', '0\n']

如果第一个数字为负数,则左侧有13个空格,如果第一个数字为正数,则左侧有14个空格。

n = 1
[l[i:i+n] for i in range(0, len(l), n)]
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '-', '0', '.', '2', '4', '7', '3', '2', '5', '6', '8', '8', '6', '0', '0', '5', 'E', '+', '0', '0', '-', '0', '.', '6', '6', '5', '6', '6', '8', '6', '9', '6', '1', '6', '5', '1', 'E', '-', '0', '1', '-', '0', '.', '9', '6', '6', '6', '4', '3', '0', '9', '5', '0', '1', '6', '5', 'E', '+', '0', '0', '\n']

如何让它忽略第一块空格,然后将其分成三列数字并制作一个数组?

5 个答案:

答案 0 :(得分:1)

使用正则表达式:

import re
for line in open("sphererad1.out"):
    print(list(map(float, re.findall(' *(-?\\d+\\.\\d*[eE][+-]\\d+)', line))))

[-0.02066687680781, 0.4329528510571, -0.9011796712875]
[-0.04119676724076, 0.4006276726723, -0.9153143167496]
[0.1022378727794, 0.2991854846478, -0.9487020373344]
[0.02066854201257, 0.3005275726318, -0.9535492062569]
[0.04130198806524, 0.3341401219368, -0.9416180849075]
[0.06145291402936, 0.3000802397728, -0.951932489872]
[0.08211978524923, 0.3335199654102, -0.9391596317291]
[0.0618653036654, 0.3671853244305, -0.9280881881714]
[-0.02066862955689, 0.3678680062294, -0.9296482801437]
[0.02066862955689, 0.3678680062294, -0.9296482801437]
[0.0, 0.3344254791737, -0.9424222111702]
[0.5163235664368, -0.03289847448468, -0.8557614684105]
[0.5062980055809, -0.06575757265091, -0.859847843647]
[0.4863796830177, -0.03290597721934, -0.873127758503]
[0.4844416379929, -0.1312004029751, -0.864929318428]
[0.465286552906, -0.09858986735344, -0.8796525001526]
[0.4453650414944, -0.06581693142653, -0.8929267525673]
[0.476117670536, -0.06582681834698, -0.8769143819809]

答案 1 :(得分:1)

我首先使用string.strip()删除每行开头(和结尾)的空格,然后尝试使用您在上面的问题中已经概述的方法每18个字符拆分它。:

def parse_line(line):
    return [line[i:i+n].strip() for i in range(0, len(l), n)]

def get_matrix(filename):
    with open(filename) as f:
        return [parse_line(line.strip()) for line in f.readlines()] 

或者,您可以调整行解析代码,以便从第0个索引开始,而不是从0索引开始。但是,这是一个不太强大的解决方案,所以我仍然会选择第一个。

def parse_line(line):
    return [line[i:i+n].strip() for i in range(13, len(l), n)]

def get_matrix(filename):
    with open(filename) as f:
        return [parse_line(line) for line in f.readlines()] 

答案 2 :(得分:1)

使用numpy.genfromtxt解析固定宽度的文件。 delimiter参数可以设置为字段宽度序列。 autostrip从数据中删除空格。

numpy.genfromtxt(fname, delimiter=(33, 20, 20), autostrip=True)

答案 3 :(得分:0)

如果你只是遇到负数的问题,你可以在每个非指数负数之前在文件的每一行注入一个空格:

import numpy as np
import re

values = []
with open(input) as handle:
    for line in handle:
        values.append(map(float, re.sub(r'(?<![eE])[-]', ' -', line).split()))
values = np.asarray(values)

这里我使用负面的lookbehind断言来阻止匹配E-

答案 4 :(得分:0)

你不能轻易切片吗?

for line in bad_file:
    print float(line[13:33]), float(line[33:53]), float(line[53:73])

或者一次性获取所有数据:

new_data = [
    [float(line[13:33]), float(line[33:53]), float(line[53:73])]
    for line in bad_file
]