在Python中解析用缩进结构化的文本

时间:2014-04-24 16:35:57

标签: python parsing python-2.7 text

我正试图找出一种有效的方法来解析一些用缩进(来自word doc)构造的明文。示例(注意:下面的缩进不能在移动版本的SO上呈现):

Attendance records 8 F 1921-2010 Box 2 1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010 Number of meetings attended each year 1 F 1991-1994 Box 2 Papers re: Safaris 10 F 1951-2011 Box 2 Incomplete; Includes correspondence about beginning “Safaris” may also include announcements, invitations, reports, attendance, and charges; some photographs. See also: Correspondence and Minutes

因此,未缩进的文本是父记录数据,每个父数据行下面的每组缩进文本都是该数据的一些注释(它们本身也分成多行)。

到目前为止,我有一个粗略的脚本来解析未缩进的父行,以便我得到一个字典项列表:

import re

f = open('example_text.txt', 'r')

lines = f.readlines()

records = []

for line in lines:

if line[0].isalpha():
        processed = re.split('\s{2,}', line)


        for i in processed:
        title = processed[0]
        rec_id = processed[1]
        years = processed[2]
        location = processed[3]

    records.append({
        "title": title,
        "id": rec_id,
        "years": years,
        "location": location
    })


elif not line[0].isalpha():

    print "These are the notes, but attaching them to the above records is not clear"


print records`

这会产生:

[{'id': '8 F', 'location': 'Box 2', 'title': 'Attendance records', 'years': '1921-2010'}, {'id': '1 F', 'location': 'Box 2', 'title': 'Number of meetings attended each year', 'years': '1991-1994'}, {'id': '10 F', 'location': 'Box 2', 'title': 'Papers re: Safaris', 'years': '1951-2011'}]

但是现在我想在每条记录中添加以下效果的注释:

[{'id': '8 F', 'location': 'Box 2', 'title': 'Attendance records', 'years': '1921-2010', 'notes': '1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010' }, ...]

令我感到困惑的是,我正在逐行采用这种程序方法,而且我不确定是否有更多的Pythonic方法可以做到这一点。我更习惯于使用抓页网和那些至少你有选择器的人,在这里很难一次一个接一个地走下去,我希望有人能够摆脱我的想法松散并提供关于更好地攻击这个问题的新观点。

更新 只需在缩进的行中添加下面的答案建议的条件就可以了:

import re
import repr as _repr
from pprint import pprint


f = open('example_text.txt', 'r')

lines = f.readlines()

records = []

for line in lines:

    if line[0].isalpha():
        processed = re.split('\s{2,}', line)

        #print processed

        for i in processed:
            title = processed[0]
            rec_id = processed[1]
            years = processed[2]
            location = processed[3]

    if not line[0].isalpha():


        record['notes'].append(line)
        continue

    record = { "title": title,
               "id": rec_id,
               "years": years,
               "location": location,
               "notes": []}

    records.append(record)





pprint(records)

1 个答案:

答案 0 :(得分:1)

由于您已经解决了记录的解析问题,我将只关注如何阅读每个记录:

records = []

with open('data.txt', 'r') as lines:
    for line in lines:
        if line.startswith ('\t'):
            record ['notes'].append (line [1:])
            continue
        record = {'title': line, 'notes': [] }
        records.append (record)

for record in records:
    print ('Record is', record ['title'] )
    print ('Notes are', record ['notes'] )
    print ()