Question

我正在修复我想在Github上使用的破坏的lib。

我在本地“修复”了这个问题。但我认为这不是一个非常干净的方法......

我正在通过互联网存档戳戳WARC库，并且特别是arc.py部分（https://github.com/internetarchive/warc/blob/master/warc/arc.py）。

由于编写了lib，使得ARC文件的工具发生了一些变化，因此内置解析器失败，因为它不希望在文件中看到一些元数据。

我的本地修复程序如下所示：

    if header.startswith("<arcmetadata"):
        while not header.endswith("</arcmetadata>\n"):
            header = self.fileobj.readline()
        header = self.fileobj.readline()
        header = self.fileobj.readline()

而且我不确定我调用readlines()两次删除接下来的两行空行（包含"/n"）是推进文件对象的最简洁方法。

这是好蟒蛇吗？还是有更好的方法？

Answer 1

代码看起来像是复制/粘贴错误。使用.readline()没有任何问题，只记录您正在做的事情：

# skip metadata
if header.startswith("<arcmetadata"):
    while not header.endswith("</arcmetadata>\n"):
        header = self.fileobj.readline()
    #NOTE: header ends with `"</arc..."` here i.e., it is not blank

# skip blank lines
while not header.strip():
    header = self.fileobj.readline()

顺便说一句，如果文件包含xml，那么使用xml解析器来解析它。不要手工做。

Answer 2

虽然你正在做的事情本身没有任何错误，但写作可能更具语义性：

next(self.fileobj, None)

没有变量赋值表示你正在抛出下一行。

Answer 3

itertools可能会在这里使用

from itertools import islice, dropwhile
if header.startswith("<arcmetadata"):
    fileobj = dropwhile(lambda x: not x.endswith("</arcmetadata>\n"), fileobj)
    fileobj = islice(fileobj, 2, None)

Python - 使用readlines处理第n行跳转（）

3 个答案: