我想读取一个arff文件,并将属性和数据分成不同的列表。该文件为here。我尝试了以下代码
from itertools import dropwhile
attributes = []
with open('balloons.arff', 'r') as f:
for l in f.readlines(): ##1
items = l.split(' ') ##2
if items[0] == '@attribute': ##3
attributes.append(items[1]) ##4
data = dropwhile(lambda _line: "@data" not in _line, f) ##5
next(data,"") ##6
for line in data: ##7
print(line.strip()) ##8
print(attributes) ##9
当我运行此代码时,我只获取属性列表,但是当我将行号## 1注释为## 4(第一个用于循环)时,程序正确地给出了数据部分。我有非常大的文件,一个有效的解决方案将不胜感激。
答案 0 :(得分:1)
没有必要重新发明轮子。其他人已经为Python编写了一个ARFF解析器,liac-arff。使用pip
安装它:
pip install liac-arff
然后导入并使用模块:
import arff
with open('balloons-adult-stretch.arff', 'rb') as handle:
data = arff.load(handle)
print(data['attributes'])
print(data['data'])
输出:
[(u'V1', [u'PURPLE', u'YELLOW']), (u'V2', [u'LARGE', u'SMALL']), (u'V3', [u'DIP', u'STRETCH']), (u'V4', [u'ADULT', u'CHILD']), (u'Class', [u'1', u'2'])]
[[u'YELLOW', u'SMALL', u'STRETCH', u'ADULT', u'2'], [u'YELLOW', u'SMALL', u'STRETCH', u'CHILD', u'2'], [u'YELLOW', u'SMALL', u'DIP', u'ADULT', u'2'], [u'YELLOW', u'SMALL', u'DIP', u'CHILD', u'1'], [u'YELLOW', u'SMALL', u'DIP', u'CHILD', u'1'], [u'YELLOW', u'LARGE', u'STRETCH', u'ADULT', u'2'], [u'YELLOW', u'LARGE', u'STRETCH', u'CHILD', u'2'], [u'YELLOW', u'LARGE', u'DIP', u'ADULT', u'2'], [u'YELLOW', u'LARGE', u'DIP', u'CHILD', u'1'], [u'YELLOW', u'LARGE', u'DIP', u'CHILD', u'1'], [u'PURPLE', u'SMALL', u'STRETCH', u'ADULT', u'2'], [u'PURPLE', u'SMALL', u'STRETCH', u'CHILD', u'2'], [u'PURPLE', u'SMALL', u'DIP', u'ADULT', u'2'], [u'PURPLE', u'SMALL', u'DIP', u'CHILD', u'1'], [u'PURPLE', u'SMALL', u'DIP', u'CHILD', u'1'], [u'PURPLE', u'LARGE', u'STRETCH', u'ADULT', u'2'], [u'PURPLE', u'LARGE', u'STRETCH', u'CHILD', u'2'], [u'PURPLE', u'LARGE', u'DIP', u'ADULT', u'2'], [u'PURPLE', u'LARGE', u'DIP', u'CHILD', u'1'], [u'PURPLE', u'LARGE', u'DIP', u'CHILD', u'1']]
如果你想自己写这个,你的代码的问题是你的第一个循环从文件中读取所有行。您必须在循环结束后将文件句柄回滚到f.seek(0)
的开头,或者通过实现一个简单的状态机一次解析它:
attributes = {}
data = []
reading_data = False
with open('balloons-adult-stretch.arff', 'r') as handle:
for line in handle:
line = line.strip()
# Ignore comments and whitespace
if line.startswith('%%') or not line:
continue
# If we have already reached the @data section, we just read indefinitely
# If @data doesn't come last, this will not work
if reading_data:
data.append(line)
continue
# Otherwise, try parsing the file
if line.startswith('@attribute'):
key, value = line.split(' ', 2)[1:]
attributes[key] = value
elif line.startswith('@data'):
reading_data = True
else:
#raise ValueError('Cannot parse line {!r}'.format(line))
pass
答案 1 :(得分:1)
问题在于,在for循环中,您已经达到了EOF(文件末尾)。这意味着,一旦启动lambda函数,就无需在文件中读取任何内容。你可以找到一种方法来读取for循环中的数据,或者如果你想(有些)效率低下,你可以这样做:
from itertools import dropwhile
attributes = []
with open('stuff.txt', 'r') as f:
for l in f.readlines(): ##1
items = l.split(' ') ##2
if items[0] == '@attribute': ##3
attributes.append(items[1])
f.seek(0) ##4
data = dropwhile(lambda _line: "@data" not in _line, f) ##5
next(data,"") ##6
for line in data: ##7
print(line.strip()) ##8
print(attributes)
答案 2 :(得分:0)
第5行到第9行不再是for循环的一部分了,所以" f"没有定义我猜