Question

我从gutenberg.org拿了一本文本格式的书，我正在尝试阅读文本，但跳过文件的开头部分，然后使用我编写的过程函数来解析其余部分。我怎样才能做到这一点？

这是文本文件的开头。

> The Project Gutenberg EBook of The Kama Sutra of Vatsyayana, by Vatsyayana

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net


Title: The Kama Sutra of Vatsyayana
       Translated From The Sanscrit In Seven Parts With Preface,
       Introduction and Concluding Remarks

Author: Vatsyayana

Translator: Richard Burton
            Bhagavanlal Indrajit
            Shivaram Parashuram Bhide

Release Date: January 18, 2009 [EBook #27827]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK THE KAMA SUTRA OF VATSYAYANA ***




Produced by Bruce Albrecht, Carla Foust, Jon Noring and
the Online Distributed Proofreading Team at
http://www.pgdp.net

和我目前处理整个文件的代码。

import string

def process_file(filename):
    """ opens a file and passes back a list of its words"""
    h = dict()
    fin = open(filename)
    for line in fin:
        process_line(line, h)
    return h

def process_line(line, h):
    line = line.replace('-', ' ')

    for word in line.split():
        word = word.strip(string.punctuation + string.whitespace)
        word = word.lower()

        h[word] = h.get(word,0)+1

Answer 1

添加：

for line in fin:
   if "START OF THIS PROJECT GUTENBERG BOOK" in line:
       break

就在你自己的“for line in fin：”循环之前。

Answer 2

好吧，您可以直接阅读输入，直到符合您的条件跳过开头：

def process_file(filename):
    """ opens a file and passes back a list of its words"""
    h = dict()
    fin = open(filename)

    for line in fin:
        if line.rstrip() == "*** START OF THIS PROJECT GUTENBERG EBOOK THE KAMA SUTRA OF VATSYAYANA ***":
            break

    for line in fin:
        process_line(line, h)

    return h

请注意，我在此示例中使用line.rstrip() == "*** START OF THIS PROJECT GUTENBERG EBOOK THE KAMA SUTRA OF VATSYAYANA ***"作为标准，但您可以完全设置自己的标准。

读入文件并跳过Python中文本文件的标题部分

2 个答案: