说我有一个包含以下内容的文件:
/* Full name: abc */
.....
.....(.....)
.....(".....) ;
/* .....
/* .....
..... : "....."
}
"....., .....
Car : true ;
House : true ;
....
....
Age : 33
....
/* Full name: xyz */
....
....
Car : true ;
....
....
Age : 56
....
我只对每个人的全名,汽车,房子和年龄感兴趣。在我感兴趣的变量/ attritbute之间还有许多不同格式的数据行。
到目前为止我的代码:
import re
initial_val = {'House': 'false', 'Car': 'false'}
with open('input.txt') as f:
records = []
current_record = None
for line in f:
if not line.strip():
continue
elif current_record is None:
people_name = re.search('.+Full name ?: (.+) ', line)
if people_name:
current_record = dict(initial_val, Name = people_name.group(1))
else:
continue
elif current_record is not None:
house = re.search(' *(House) ?: ?([a-z]+)', line)
if house:
current_record['House'] = house.group(2)
car = re.search(' *(Car) ?: ?([a-z]+)', line)
if car:
current_record['Car'] = car.group(2)
people_name = re.search('.+Full name ?: (.+) ', line)
if people_name:
records.append(current_record)
current_record = dict(initial_val, Name = people_name.group(1))
print records
我得到了什么:
[{'Name': 'abc', 'House': 'true', 'Car': 'true'}]
我的问题:
我如何提取数据并将其存储在如下字典中:
{'abc': {'Car': true, 'House': true, 'Age': 33}, 'xyz':{'Car': true, 'House': false, 'Age': 56}}
我的目的:
检查每个人是否有车,房子和年龄,如果没有则返回假
我可以在这样的表格中打印出来:
Name Car House Age
abc true true 33
xyz true false 56
请注意,我使用的是Python 2.7,我不知道每个人的每个变量/属性(例如,abc,true,true,33)的实际值是多少。
我的问题的最佳解决方案是什么?感谢。
答案 0 :(得分:1)
嗯,你只需跟踪当前的记录:
def parse_name(line):
# first remove the initial '/* ' and final ' */'
stripped_line = line.strip('/* ')
return stripped_line.split(':')[-1]
WANTED_KEYS = ('Car', 'Age', 'House')
# default values for when the lines are not present for a record
INITIAL_VAL = {'Car': False, 'House': False, Age: -1}
with open('the_filename') as f:
records = []
current_record = None
for line in f:
if not line.strip():
# skip empty lines
continue
elif current_record is None:
# first record in the file
if line.startswith('/*'):
current_record = dict(INITIAL_VAL, name=parse_name(line))
else:
# this should probably be an error in the file contents
continue
elif line.startswith('/*'):
# this means that the current record finished, and a new one is starting
records.append(current_record)
current_record = dict(INITIAL_VAL, name=parse_name(line))
else:
key, val = line.split(':')
if key.strip() in WANTED_KEYS:
# we want to keep track of this field
current_record[key.strip()] = val.strip()
# otherwise just ignore the line
print('Name\tCar\tHouse\tAge')
for record in records:
print(record['name'], record['Car'], record['House'], record['Age'], sep='\t')
请注意,对于Age
,您可能希望使用int
将其转换为整数:
if key == 'Age':
current_record['Age'] = int(val)
上面的代码生成了一个字典列表,但很容易将其转换为dicts字典:
new_records = {r['name']: dict(r) for r in records}
for val in new_records.values():
del val['name']
此new_records
之后将是:
{'abc': {'Car': True, 'House': True, Age: 20}, ...}
如果在有趣的行之间有其他格式不同的行,您只需编写一个返回True
或False
的函数,具体取决于行是否采用您需要的格式并使用它到filter
文件的行:
def is_interesting_line(line):
if line.startswith('/*'):
return True
elif ':' in line:
return True
for line in filter(is_interesting_line, f):
# code as before
更改is_interesting_line
以满足您的需求。最后,如果你必须处理几种不同的格式等,也许使用正则表达式会更好,在这种情况下你可以做类似的事情:
import re
LINE_REGEX = re.compile(r'(/\*.*\*/)|(\w+\s*:.*)| <other stuff>')
def is_interesting_line(line):
return LINE_REGEX.match(line) is not None
如果你想要你可以获得更好的表格格式,但你可能首先需要确定名称的最大长度等,或者你可以使用类似tabulate
之类的东西为你做这些。
例如(未经测试):
max_name_length = max(max(len(r['name']) for r in records), 4)
format_string = '{:<{}}\t{:<{}}\t{}\t{}'
print(format_string.format('Name', max_name_length, 'Car', 5, 'House', 'Age'))
for record in records:
print(format_string.format(record['name'], max_name_length, record['Car'], 5, record['House'], record['Age']))