Question

我在泡菜中。

我在dd中搜索页眉和页脚。

当我找到标题时，我将其放入列表中 - 对于页脚也是如此。

但是我注意到有时在标题之前找到一个页脚，在这种情况下我检查标题列表是否为空，如果是，则忽略页脚。这很好。

   if not header:
       footer.append(myfooter)

然而，有时我会找到一个标题，然后是两个页脚，然后再一个标题。然后将这些添加到列表中 - 这不是我的目标。

我的目标是将页眉与页脚一起映射，只要它们直接在一个和另一个之后。因此，两个列表应始终具有相同的数量，或者标题列表应大于页脚列表。

如何相应地实现算法？

Answer 1

这是你想要的吗？我不确定'dd'是什么，我假装页眉和页脚是以特定方式开始的单行。您需要替换与页眉和页脚匹配的逻辑以适合您的用例。

import re

VALID_DOCUMENT_A = \
"""
header: 1
Hi there.
This is the body of page 1.
footer: 1
header: 2
This is the second page.
footer: 2
"""

VALID_DOCUMENT_B = \
"""
header: 1
Hi there.
This is the body of page 1.
footer: 1
header: 2
This is the second page.
footer: 2
header: 3
This third page has a header but no footer.
"""

INVALID_DOCUMENT_A = \
"""
header: 1
Hi there.
This is the body of page 1.
footer: 1
This is the second page. Where's the header, though?
footer: 2
"""

INVALID_DOCUMENT_B = \
"""
header: 1
Hi there.
This is the body of page 1.
footer: 1a
footer: 1b
This is the second page. Oops - two footers above.
footer: 2
"""

INVALID_DOCUMENT_C = \
"""
header: 1
Hi there.
This is the body of page 1.
footer: 1
header: 2a
header: 2b
This is the second page. Oops - two headers above.
footer: 2
"""


def is_header(line):
    return re.match('^header:.*', line)


def is_footer(line):
    return re.match('^footer:.*', line)


def pair_headers_and_footers(text):
    lines = text.splitlines()

    stack = []
    for i, line in enumerate(lines, 1):
        if is_header(line):
            if stack:
                raise ValueError('Got unexpected header on line {}'.format(i))

            stack.append(line)

        elif is_footer(line):
            if not stack:
                raise ValueError('Got unexpected footer on line {}'.format(i))

            yield stack.pop(), line


if __name__ == '__main__':
    documents = [
        VALID_DOCUMENT_A,    # 2 headers, 2 footers
        VALID_DOCUMENT_B,    # 3 headers, 2 footers
        INVALID_DOCUMENT_A,  # missing header for page 1
        INVALID_DOCUMENT_B,  # multiple footers for page 1
        INVALID_DOCUMENT_C   # multiple headers for page 2
    ]

    for document in documents:
        try:
            print(list(pair_headers_and_footers(document)))
        except ValueError as e:
            print(e)

<强>输出

[('header: 1', 'footer: 1'), ('header: 2', 'footer: 2')]
[('header: 1', 'footer: 1'), ('header: 2', 'footer: 2')]
Got unexpected footer on line 7
Got unexpected footer on line 6
Got unexpected header on line 7

<强>附录

我应该在函数pair_headers_and_footers中添加，你可以替换：

lines = text.splitlines()

使用：

lines = (m.group(0).rstrip() for m in re.finditer('(.*\n|.+$)', text))

这可能有助于减少内存使用量，尤其是在处理大量文本时。通过这种修改，页眉和页脚配对的整个过程变得“懒惰”。

Answer 2

如果上次看到的项目是标题，则只向页面添加页脚。那是你想要做的吗？

映射两个不同的列表

2 个答案: