如何从多行搜索多个数据并将其存储在字典中?

时间:2016-06-23 09:00:12

标签: python regex python-2.7 dictionary

说我有一个包含以下内容的文件:

/* Full name: abc */
.....
.....(.....)
.....(".....) ;
/* .....
/* .....
..... : "....."
}
"....., .....
Car : true ;
House : true ;
....
....
Age : 33
....
/* Full name: xyz */
....
....
Car : true ;
....
....
Age : 56
....

我只对每个人的全名,汽车,房子和年龄感兴趣。在我感兴趣的变量/ attritbute之间还有许多不同格式的数据行。

到目前为止我的代码:

import re

initial_val = {'House': 'false', 'Car': 'false'}

with open('input.txt') as f:
    records = []
    current_record = None
    for line in f:
        if not line.strip():
            continue
        elif current_record is None:
            people_name = re.search('.+Full name ?: (.+) ', line)
            if people_name:
                current_record = dict(initial_val, Name = people_name.group(1))
            else:
                continue
        elif current_record is not None:
            house = re.search(' *(House) ?: ?([a-z]+)', line)
            if house:
                current_record['House'] = house.group(2)
            car = re.search(' *(Car) ?: ?([a-z]+)', line)
            if car:
                current_record['Car'] = car.group(2)
            people_name = re.search('.+Full name ?: (.+) ', line)
            if people_name:
                records.append(current_record)
                current_record = dict(initial_val, Name = people_name.group(1))                       

print records

我得到了什么:

[{'Name': 'abc', 'House': 'true', 'Car': 'true'}]

我的问题:

我如何提取数据并将其存储在如下字典中:

{'abc': {'Car': true, 'House': true, 'Age': 33}, 'xyz':{'Car': true, 'House': false, 'Age': 56}}

我的目的:

检查每个人是否有车,房子和年龄,如果没有则返回假

我可以在这样的表格中打印出来:

Name Car House Age
abc true true 33
xyz true false 56

请注意,我使用的是Python 2.7,我不知道每个人的每个变量/属性(例如,abc,true,true,33)的实际值是多少。

我的问题的最佳解决方案是什么?感谢。

1 个答案:

答案 0 :(得分:1)

嗯,你只需跟踪当前的记录:

def parse_name(line):
    # first remove the initial '/* ' and final ' */'
    stripped_line = line.strip('/* ')
    return stripped_line.split(':')[-1]


WANTED_KEYS = ('Car', 'Age', 'House')

# default values for when the lines are not present for a record
INITIAL_VAL = {'Car': False, 'House': False, Age: -1}

with open('the_filename') as f:
    records = []
    current_record = None

    for line in f:
        if not line.strip():
             # skip empty lines
             continue
        elif current_record is None:
             # first record in the file
             if line.startswith('/*'):
                 current_record = dict(INITIAL_VAL, name=parse_name(line))
             else:
                 # this should probably be an error in the file contents
                 continue
        elif line.startswith('/*'):
            # this means that the current record finished, and a new one is starting
            records.append(current_record)
            current_record = dict(INITIAL_VAL, name=parse_name(line))
        else:
            key, val = line.split(':')
            if key.strip() in WANTED_KEYS:
                # we want to keep track of this field
                current_record[key.strip()] = val.strip()
            # otherwise just ignore the line


print('Name\tCar\tHouse\tAge')
for record in records:
    print(record['name'], record['Car'], record['House'], record['Age'], sep='\t')

请注意,对于Age,您可能希望使用int将其转换为整数:

if key == 'Age':
    current_record['Age'] = int(val)

上面的代码生成了一个字典列表,但很容易将其转换为dicts字典:

new_records = {r['name']: dict(r) for r in records}
for val in new_records.values():
    del val['name']

new_records之后将是:

{'abc': {'Car': True, 'House': True, Age: 20}, ...}

如果在有趣的行之间有其他格式不同的行,您只需编写一个返回TrueFalse的函数,具体取决于行是否采用您需要的格式并使用它到filter文件的行:

def is_interesting_line(line):
    if line.startswith('/*'):
        return True
    elif ':' in line:
        return True

for line in filter(is_interesting_line, f):
    # code as before

更改is_interesting_line以满足您的需求。最后,如果你必须处理几种不同的格式等,也许使用正则表达式会更好,在这种情况下你可以做类似的事情:

import re

LINE_REGEX = re.compile(r'(/\*.*\*/)|(\w+\s*:.*)| <other stuff>')

def is_interesting_line(line):
    return LINE_REGEX.match(line) is not None

如果你想要你可以获得更好的表格格式,但你可能首先需要确定名称的最大长度等,或者你可以使用类似tabulate之类的东西为你做这些。

例如(未经测试):

max_name_length = max(max(len(r['name']) for r in records), 4)
format_string = '{:<{}}\t{:<{}}\t{}\t{}'
    print(format_string.format('Name', max_name_length, 'Car', 5,  'House', 'Age'))
    for record in records:
        print(format_string.format(record['name'], max_name_length, record['Car'], 5, record['House'], record['Age']))