Question

在艾伦·唐尼的Think Python中，13-2要求处理来自gutenberg.org的任何.txt文件，并跳过标题信息，这些信息以“Produced by”结尾。这是作者给出的解决方案：

def process_file(filename, skip_header):
    """Makes a dict that contains the words from a file.
    box  = temp storage unit to combine two following word in one string
    res = dict
    filename: string
    skip_header: boolean, whether to skip the Gutenberg header

    returns: map from string of two word from file to list of words that comes 
    after them
    Last two word in text maps to None"""
    res = {}

    fp = open(filename)

    if skip_header:
        skip_gutenberg_header(fp)

    for line in fp:
        process_line(line, res)


    return res

def process_line(line, res):

    for word in line.split():

        word = word.lower().strip(string.punctuation)
        if word.isalpha():
            res[word] = res.get(word, 0) + 1


def skip_gutenberg_header(fp):
    """Reads from fp until it finds the line that ends the header.

    fp: open file object
    """
    for line in fp:
        if line.startswith('Produced by'):
            break

我真的不明白这段代码中的执行缺陷。一旦代码开始使用skip_gutenberg_header（fp）读取文件，其中包含“for line in fp：”;它找到了所需的线和休息时间。然而，下一个循环会在break语句离开的地方找到。但为什么？我对它的看法是，这里有两个独立的迭代，包含“for line in fp：”，所以不应该从头开始第二个？

Answer 1

不，它不应该从一开始就重新开始。打开的文件对象维护一个文件位置指示器，当您读取（或写入）该文件时，该指示器会被移动。您还可以通过文件的.seek方法移动位置指示器，并通过.tell方法进行查询。

因此，如果您突然离开for line in fp:循环，您可以继续阅读另一个for line in fp:循环停止的位置。

BTW，文件的这种行为并不是特定于Python的：所有继承C的流和文件概念的现代语言都是这样的。

the tutorial中简要提到了.seek和.tell方法。

有关Python中文件/流处理的更深入处理，请参阅io模块的文档。该文档中有一个 lot 信息，其中一些信息主要供高级编码人员使用。您可能需要多次阅读并编写一些测试程序来吸收它所说的内容，因此在您第一次尝试阅读时或在前几次时可以随意浏览它。 ;）

Answer 2

我对它的看法是，这里有两个独立的迭代都包含“for line in fp：”，所以不应该从头开始第二个？

如果fp是列表，那么他们当然会。然而，它不是 - 它只是一个可迭代的。在这种情况下，它是一个类似文件的对象，其中包含seek，tell和read等方法。对于类文件对象，它们保持状态。当你从它们中读取一行时，它会改变文件中读指针的位置，所以下一次读取会在下面开一行。

这通常用于跳过表格数据的标题（当你至少没有使用csv.reader时）

with open("/path/to/file") as f:
    headers = next(f).strip()  # first line
    for line in f:
        # iterate by-line for the rest of the file
        ...

处理.txt文件时如何跳过标题？

2 个答案: