我有一个由FORTRAN程序生成的文本文件,格式很奇怪(肯定很烦人):
3.4502 1.5959 0.2160 0.9423 0.1098 1.2463 -2.8673 0.8803
3.5724 1.8022 0.3423 1.0801 2.4177 -0.2012 -0.1142 -0.2061
2.6028 2.6395 0.2959 0.8280 2.0526 -0.0721 -1.1345 0.0110
2.5628 0.0000 0.0539 0.0000 -0.4520 1.3030 -3.0792 1.0428
1.1823 1.4084 0.2315 1.1359 1.5945 3.2098 1.6739 0.0713
0.0296 1.3689 0.0000 1.0425 -0.4525 1.3043 -2.9785 1.0428
2.4825 1.6460 0.2573 2.4801 3.4533 1.5960 0.3609 0.9574
2.2358 0.8858 0.1344 0.5376 3.1102 -0.8025 0.1282 -0.8398
0.0000 1.4078 1.5464 1.0526 3.9754 3.7823 0.3376 0.1303
3.3068 2.5148 0.2390 -0.3816
-0.4672 1.3604 2.0157 1.0405
4.4009 2.9969 0.8777 3.6270
3.0271 4.1610 0.2094 3.0105
-0.4889 1.3888 3.1442 1.0423
6.0767 1.7731 0.6439 2.3744
5.9313 1.3423 0.2204 1.0397
4.4335 2.9075 -0.0328 -0.4526
4.8670 2.6906 0.1088 0.0275
2.5303 3.3157 -0.2649 0.9895
4.3957 3.4142 0.3900 0.4282
3.3185 1.4058 0.2024 3.3997
0.9097 1.3423 0.2388 1.1809
1.3302 1.6167 0.2009 1.0491
2.4382 -0.1739 0.4722 3.5331
1.8617 1.4082 0.2140 0.6741
我想分别阅读前四列和后四列,并将它们存储在Numpy数组中。使用numpy.genfromtxt,我可以轻松地从前四列获取数据:
object_scores = numpy.genfromtxt("results.out", usecols=(0,1,2,3), max_rows=9)
但是尝试对其他四列进行相同操作
descriptor_scores = numpy.genfromtxt("results.out", usecols=(4,5,6,7), max_rows=25)
我收到了一长串错误消息,这些错误消息似乎与前四列中缺少的单元格有关。
ValueError: Some errors were detected !
Line #10 (got 4 columns instead of 1)
Line #11 (got 4 columns instead of 1)
Line #12 (got 4 columns instead of 1)
Line #13 (got 4 columns instead of 1)
Line #14 (got 4 columns instead of 1)
Line #15 (got 4 columns instead of 1)
Line #16 (got 4 columns instead of 1)
Line #17 (got 4 columns instead of 1)
Line #18 (got 4 columns instead of 1)
Line #19 (got 4 columns instead of 1)
Line #20 (got 4 columns instead of 1)
Line #21 (got 4 columns instead of 1)
Line #22 (got 4 columns instead of 1)
Line #23 (got 4 columns instead of 1)
Line #24 (got 4 columns instead of 1)
Line #25 (got 4 columns instead of 1)
有关如何解决此问题的任何提示或建议?
答案 0 :(得分:1)
不幸的是,这些列的宽度似乎不一样(前四个字段为10,然后为11)。在这种情况下,delimiter=
的{{1}}选项可以为您提供帮助。
以下是从第37列开始读取4个字段的替代解决方案:
numpy.genfromtxt
答案 1 :(得分:0)
如果文件格式始终相同,则可以这样做:
import numpy as np
def squash(obj):
return [[float(element) for element in column if element.strip() != ''] for column in obj]
with open('results.out') as f:
data = f.read()
lines = data.split('\n')
number_width = 6
number_spacing = 4
result = squash(zip(*[[line[i:i + number_width] for i in range(0, len(line), number_width + number_spacing)]
for line in lines]))
first_four_cols = np.array(result[0:4]).T
last_four_cols = np.array(result[4:]).T
答案 2 :(得分:0)
复制并粘贴到文件
In [85]: data = np.genfromtxt('stack54544789.py', delimiter=[10]*8)
In [86]: data
Out[86]:
array([[3.4502, 1.5959, 0.216 , 0.9423, 0.1098, nan, 2.8673, 0.8803],
[3.5724, 1.8022, 0.3423, 1.0801, nan, nan, nan, 0.2061],
[2.6028, 2.6395, 0.2959, 0.828 , nan, nan, 1.1345, 0.011 ],
[2.5628, 0. , 0.0539, nan, 0.452 , nan, 3.0792, 1.0428],
[1.1823, 1.4084, 0.2315, 1.1359, 1.5945, 3.2098, 1.6739, 0.0713],
...
[ nan, nan, nan, nan, 1.3302, 1.6167, 0.2009, 1.0491],
[ nan, nan, nan, nan, nan, 0.1739, 0.4722, 3.5331],
[ nan, nan, nan, nan, 1.8617, 1.4082, 0.214 , 0.6741],
[ nan, nan, nan, nan, nan, nan, nan, nan]])
这看起来几乎是对的;我认为多余的nan
来自放错了位置的负面信号。
In [87]: data = np.genfromtxt('stack54544789.py', delimiter=[9]+[10]*7)
In [88]: data
Out[88]:
array([[ 3.4502, 1.5959, 0.216 , 0.9423, 0.1098, 1.2463, -2.8673,
0.8803],
[ 3.5724, 1.8022, 0.3423, 1.0801, 2.4177, -0.2012, -0.1142,
-0.2061],
[ 2.6028, 2.6395, 0.2959, 0.828 , 2.0526, -0.0721, -1.1345,
0.011 ],
[ 2.5628, 0. , 0.0539, 0. , -0.452 , 1.303 , -3.0792,
1.0428],
...
[ nan, nan, nan, nan, 2.4382, -0.1739, 0.4722,
3.5331],
[ nan, nan, nan, nan, 1.8617, 1.4082, 0.214 ,
0.6741],
[ nan, nan, nan, nan, nan, nan, nan,
nan]])
答案 3 :(得分:0)
尽管它肯定与.csv
之类的定界格式有所不同(因此可能有些烦人),但Fortran和类似语言经常使用固定宽度格式,例如本例。这是因为它们在较大的文件上表现很好,并且通常直接匹配数据在内存中的表示方式,这使得使用这些语言编写代码变得更加容易。
我不确定您的示例是否包含完整的数据(StackOverflow可能会为您摆脱一些空白)。但是我希望,当您直接读取文件时,每列的宽度恰好是10个字符,您可以这样读取它:
def convert(s):
try:
return float(s)
except ValueError:
return None
data = []
size = 10
with open('input.data', 'r') as f:
for line in f:
# process line, minus the EOL (len(line)-1)
data.append([convert(line[0+i:size+i]) for i in range(0, len(line)-1, size)])
其他人已经注意到,列的宽度似乎有所不同,但是我认为这只是将数据复制到问题中的一种人工产物-字段很可能实际上在源数据中都具有相同的宽度文件。