我正在尝试使用以下结构解析文本文件:
latitude 5.0000
number_of_data_values 9
0.1 0.2 0.3 0.4
1.1 1.2 1.3 1.4
8.1
latitude 4.3000
number_of_data_values 9
0.1 0.2 0.3 0.4
1.1 1.2 1.3 1.4
8.1
latitude 4.0000
number_of_data_values 9
0.1 0.2 0.3 0.4
1.1 1.2 1.3 1.4
8.1
...
每个不同的latitude
数字是不同的数组行。
number_of_data_values
是colomns的数量(文件一致)。
对于这个例子,我想读取文件并输出一个3乘9的二维数组,如下所示:
array = [[0.1,0.2,0.3,0.4,1.1,1.2,1.3,1.4,8.1],
[0.1,0.2,0.3,0.4,1.1,1.2,1.3,1.4,8.1],
[0.1,0.2,0.3,0.4,1.1,1.2,1.3,1.4,8.1]]
我通过循环遍历循环来尝试它,但我正在寻找一种更有效的方法,因为我可以处理大量的输入文件。
答案 0 :(得分:0)
似乎非常直截了当。解析数字的部分只是line.split()
。其余的或解析可以加强或软化,具体取决于输入数据格式的稳定性。
results = []
latitude = None
numbers_total = None
value_list = []
for line in text.splitlines():
if line.startswith('latitude '):
if latitude is not None:
assert len(value_list) == numbers_total
results.append((latitude, value_list))
value_list = []
latitude = line.split()[-1]
elif line.startswith('number_of_data_values '):
numbers_total = int(line.split()[-1])
else:
value_list.extend(line.split())
# Make sure the last block gets added to the results.
if latitude is not None:
assert len(value_list) == numbers_total
results.append((latitude, value_list))
value_list = []
for latitude, value_list in results:
print 'latitude %r: %r' % (latitude, value_list)
输出:
latitude '5.0000': ['0.1', '0.2', '0.3', '0.4', '1.1', '1.2', '1.3', '1.4', '8.1']
latitude '4.3000': ['0.1', '0.2', '0.3', '0.4', '1.1', '1.2', '1.3', '1.4', '8.1']
latitude '4.0000': ['0.1', '0.2', '0.3', '0.4', '1.1', '1.2', '1.3', '1.4', '8.1']
答案 1 :(得分:0)
逐行实现相当容易理解。假设你的latitude
总是从一个新行开始(这不是你的例子给出的,但它可能是一个错字),你可以这样做:
latitudes = []
counts = []
blocks = []
current_block = []
for line in test:
print line
if line.startswith("latitude"):
# New block: add the previous one to `blocks` and reset
blocks.append(current_block)
current_block = []
latitudes.append(float(line.split()[-1]))
elif line.startswith("number_of_data"):
# Just append the current count to the list
counts.append(int(line.split()[-1]))
else:
# Update the current block
current_block += [float(f) for f in line.strip().split()]
# Make sure to add the last block...
blocks.append(current_block)
# And to remove the first (empty) one
blocks.pop(0)
您可以知道检查所有块是否具有适当的大小:
all(len(b)==c for (c,b) in zip(counts,blocks))
替代解决方案
如果您担心循环,可能需要考虑查询文件的内存映射版本。我们的想法是找到以latitude
开头的行的位置。找到一个后,找到下一个,然后你有一个文本块:删除前两行(以latitude
开头的那一行和以number_of_data
开头的那一行),合并其余的行和处理。
import mmap
with open("crap.txt", "r+b") as f:
# Create the mapper
mapper = mmap.mmap(f.fileno(), 0)
# Initialize your output variables
latitudes = []
blocks = []
# Find the beginning of the first block
position = mapper.find("latitude")
# `position` will be -1 if we can't find it
while (position >= 0):
# Move to the beginning of the block
mapper.seek(position)
# Read the first line
lat_line = mapper.readline().strip()
latitudes.append(lat_line.split()[-1])
# Read the second one
zap = mapper.readline()
# Where are we ?
start = mapper.tell()
# Where's the next block ?
position = mapper.find("latitude")
# Read the lines and combine them into a large string
current_block = mapper.read(position-start).replace("\n", " ")
# Transform the string into a list of floats and update the block
blocks.append(list(float(i) for i in current_block.split() if i))