我早上大部分时间都没有解决这个简单的问题。使用python,我想解析看起来像这样的数据文件:
# This is an example comment line, it starts with a '#' character.
# There can be a variable number of comments between each data set.
# Comments "go with" the data set that comes after them.
# The first data set starts on the next line:
0.0 1.0
1.0 2.0
2.0 3.0
3.0 4.0
# Data sets are followed by variable amounts of white space.
# The second data set starts after this comment
5.0 6.0
6.0 7.0
# One more data set.
7.0 8.0
8.0 9.0
我想要的python代码会将上面的示例解析为三个“块”,将它们存储为列表的元素。各个代码块本身可以存储为行列表,有或没有注释行,无论如何。手动方式是这样做:
#! /usr/bin/env python
# Read in data, seperate into rows_alldata
f=open("example")
rows = f.read().split('\n')
f.close()
# Do you haz teh codez?
datasets=[]
datasets.append(rows[0:8])
datasets.append(rows[9:13])
datasets.append(rows[15:18])
我正在寻找一种支持可变数量和长度的数据集的更通用的解决方案。我已经尝试了几个非pythonic外观循环的灾难。我认为最好不要与他们混淆我的问题;这是工作而不是“家庭作业”。
答案 0 :(得分:5)
使用groupby
。
from itertools import groupby
def contains_data(ln):
# just an example; there are smarter ways to do this
return ln[0] not in "#\n"
with open("example") as f:
datasets = [[ln.split() for ln in group]
for has_data, group in groupby(f, contains_data)
if has_data]
答案 1 :(得分:3)
datasets = [[]]
with open('/tmp/spam.txt') as f:
for line in f:
if line.startswith('#'):
if datasets[-1] != []:
# we are in a new block
datasets.append([])
else:
stripped_line = line.strip()
if stripped_line:
datasets[-1].append(stripped_line)
答案 2 :(得分:1)
import pprint
with open("test.txt") as fh:
codes = []
codeblock = []
for line in fh:
stripped_line = line.strip()
if not stripped_line:
continue
if stripped_line.startswith("#"):
if codeblock:
codes.append(codeblock)
codeblock = []
else:
codeblock.append(stripped_line.split(" "))
if codeblock:
codes.append(codeblock)
pprint.pprint(codes)
输出:
[[['0.0', '1.0'], ['1.0', '2.0'], ['2.0', '3.0'], ['3.0', '4.0']],
[['5.0', '6.0'], ['6.0', '7.0']],
[['7.0', '8.0'], ['8.0', '9.0']]]
答案 3 :(得分:-1)
datasets = []
with open('example') as f:
for line in f:
if line and not line.startswith('#'):
datasets.append(line.split())