我试图在python中打开一个文本文件作为数组或列表列表。该文件如下所示 此外,这是一个指向文本文件的链接 ftp://rammftp.cira.colostate.edu/demaria/ebtrk/ebtrk_atlc.txt
AL0188 ALBERTO 080518 1988 32.0 77.5 20 1015 -99 -99 -99 -99 0 0 0 0 0 0 0 0 0 0 0 0 * 218.
AL0188 ALBERTO 080600 1988 32.8 76.2 20 1014 -99 -99 -99 -99 0 0 0 0 0 0 0 0 0 0 0 0 * 213.
AL0188 ALBERTO 080712 1988 41.5 69.0 35 1002 -99 -99 1012 60 100100 50 50 0 0 0 0 0 0 0 0 * 118.
AL0188 ALBERTO 080718 1988 43.0 67.5 35 1002 -99 -99 1008 50 100100 50 50 0 0 0 0 0 0 0 0 * 144.
AL0188 ALBERTO 080800 1988 45.0 65.5 35 1004 -99 -99 1008 50 -99-99-99-99 0 0 0 0 0 0 0 0 * 22.
AL0188 ALBERTO 080806 1988 47.0 63.0 35 1006 -99 -99 1008 50 -99-99-99-99 0 0 0 0 0 0 0 0 * 64.
我尝试过使用NumPy genfromtxt,但它返回时出现错误,因为它无法判断100100是两列中的两个元素。它将其视为列中的一个条目,因此返回错误,指出每行中的列数不匹配。
有什么方法可以解决这个问题吗?谢谢
答案 0 :(得分:3)
您可以提供分隔符大小作为参数。例如:
import numpy as np
import sys
with open('ebtrk_atlc.txt', 'rU') as f:
data = np.genfromtxt(f,
dtype=None,
delimiter=[7, 10, 7, 4, 5, 6, 4, 5, 4, 4, 5, 4, 4, 3, 3, 3])
print data
将作为输出(省略前几行)
('AL0188 ', 'ALBERTO ', 80712, 1988, 41.5, 69.0, 35, 1002, -99, -99, 1012, 60, 100, 100, 50, 50)
('AL0188 ', 'ALBERTO ', 80718, 1988, 43.0, 67.5, 35, 1002, -99, -99, 1008, 50, 100, 100, 50, 50)
('AL0188 ', 'ALBERTO ', 80800, 1988, 45.0, 65.5, 35, 1004, -99, -99, 1008, 50, -99, -99, -99, -99)
如您所见,100100
字段已分开。当然,您必须提供正确的字段类型和尺寸,此示例仅表明它是可行的。例如,将代码更改为
import numpy as np
import re
import sys
with open('ebtrk_atlc.txt', 'rU') as f:
dt = "a7,a10,a7,i4,f5,f6,i4,i5,i4,i4,i5,i4,i4,i3,i3,i3"
data = np.genfromtxt(f,
dtype=dt,
delimiter=map(int, re.split(",?[a-z]", dt[1:])),
autostrip=True)
会将结果更改为
('AL0188', 'ALBERTO', '080712', 1988, 41.5, 69.0, 35, 1002, -99, -99, 1012, 60, 100, 100, 50, 50)
('AL0188', 'ALBERTO', '080718', 1988, 43.0, 67.5, 35, 1002, -99, -99, 1008, 50, 100, 100, 50, 50)
('AL0188', 'ALBERTO', '080800', 1988, 45.0, 65.5, 35, 1004, -99, -99, 1008, 50, -99, -99, -99, -99)
剥去字符串周围的空白并明确设置某些类型为float。可以找到进一步的文档here,查看底部的示例。
答案 1 :(得分:0)
旧式解析是可能的,因为结构排序很好,有点长,但似乎可以解决问题。
在:
$ awk '{print NF}' ebtrk_atlc.txt | sort | uniq -c
79 17
16 18
92 19
494 20
308 21
405 22
1769 23
897 24
1329 25
5444 26
27 27
后:
$ awk '{print NF}' log | sort | uniq -c
8778 27
2082 28
代码:
#!/usr/bin/env python
def chunks(l, n):
return [l[i:i+n] for i in range(0, len(l), n)]
with open("ebtrk_atlc.txt") as fd:
for line in fd:
cols=line.strip().split()
# 26 columns seems to be the target
# after column 13, split on -
if len(cols) < 26:
tmp = []
for i in cols[-13:]:
if '-' in i:
for n in i.split('-'):
if n:
tmp.append('-' + n)
elif len(i) == 6 or len(i) == 9 or len(i) == 12:
for n in chunks(i, 3):
tmp.append(n)
elif len(i) == 8:
# 50100100 split in 2-3-3-fashion
tmp.append(i[0:1])
tmp.append(i[2:4])
tmp.append(i[5:7])
elif len(i) == 5:
# 50100 split in 2-3-fashion
tmp.append(i[0:1])
tmp.append(i[2:4])
elif len(i) == 7:
# 0285195 split in 3-3-fashion
tmp.append(i[0])
tmp.append(i[1:3])
tmp.append(i[4:6])
elif len(i) == 11:
# 30120160200 split in 2-3-3-3-fashion
tmp.append(i[0:1])
tmp.append(i[2:4])
tmp.append(i[5:7])
tmp.append(i[8:10])
elif len(i) == 10:
# 0180180210 split in 3-3-3-fashion
tmp.append(i[0])
tmp.append(i[1:3])
tmp.append(i[4:6])
tmp.append(i[7:9])
else:
tmp.append(i)
# one final loop to fix strings beginning with a 0
tmp2 = []
for i in tmp:
if i.startswith('0') and len(i) > 2:
tmp2.append(i[0])
tmp2.append(i[1:])
else:
tmp2.append(i)
# rebuild list
data = cols[0:-13] + tmp2
print len(data), data
else:
print len(cols), cols