拆分非结构化多行文字

时间:2017-05-01 14:47:41

标签: python parsing

我有一个[半]结构化的文本文件,有各种标题。一个特定的标题有多行。文件中的回车和分页符可以包含在所需的标题下。需要将收件人的值添加到列表中。

该文件可能如下所示:

Id
1236547852012

Time
2017-05-01

Author 
mary jane (123654789)

Recipients

peter paul (987456789)

jane jackson (74125896)

Id

2017050145698

Time
2017-04-30

Author
jane jackson (74125896)

Recipients
peter paul (987456789)
\n\r
\n\r

janet jackson (74125896)

fran mckensie (85214796)
\n\r

walter wood (745896369)

Id

4569632587

Time
2017-04-29\n\r

Author 

mary jane (123654789)

Recipients

peter paul (987456789)

jane jackson (74125896)

我的每条消息的输出都需要一个收件人列表,它需要看起来像这样

[987456789, 74125896]

[987456789, 74125896, 85214796, 745896369]

[987456789, 74125896]

我的代码:

recipientList = []

with open(inputFile, 'rb') as f:

    for line in f:

        if 'Recipients' in line:
        #did this b/c recipient id would be on same line 
            lineparts = line.split(' ') 
                if len(lineparts) == 3:
                    recipient = line.strip()
                    recipientId = recipient.split('(',1)[1].replace(')','').strip()
                    recipientList.append(recipientId)

                     nextRecip = next(f).strip()
                     if nextRecip:
                         recipID = nextRecip.split('(',1)[1].replace(')','').strip()
                         recipientList.append(nextRecip)

                    anotherRecip = next(f).strip()
                    if anotherRecip:
                        recipID2 = anotherRecip.split('(',1)[1].replace(')','').strip()
                        recipientList.append(recipID2)

                if len(lineparts) == 2: #recipient on following line
                    nxtRecipient = next(f).strip()
                    if nxtRecipient:
                        nxtRecipID = nxtRecipient.split('(',1)[1].replace(')','').strip()
                        recipientList.append(nxtRecipent)

如何在不经常键入next(f)的情况下继续捕获recipientID。我想说明1 - n可能有多少个收件人;以及可能包含以下标题的分页符:

Recipients peter paul (987456789)

jane jackson (74125896)

PAGE 2

walter woods(745896369)

..以及“收件人”列表中收件人之间未知的回车金额。我希望这不会太混乱。

0 个答案:

没有答案