Question

我创建了一个扩展的SGMLParser类：

class URLLister(SGMLParser):

    def __init__(self):
        SGMLParser.__init__(self)

    def start_title(self, attrs):
        pass

    def handle_data(self, data):
        print data

非常简单的代码。 IMO start_title在遇到<title>标记时被调用，handle_data在遇到普通文本时被调用。现在我想在<title>和</title>之间提取文字，例如

<html><head><title>Webpage title</title></head><body>Simple text</body></html>

我想在Webpage title代码之间打印<title>，但使用handle_data代码我会输出所有简单文字，包括Webpage title和Simple text。如何在<title>代码？

之间简单地输出文字

Answer 1

实际上，你可以像handle_data一样添加一个硬编码的支票：

def handle_data(self, data):
    tag = self.get_starttag_text().replace("<","").replace(">","")
    tag_words = tag.split(" ")
    if len(tag_words) > 0 and tag_words[0].endswith("title"):
        print data

我不确定这是不是你想要的，我相信这是一个更优雅的答案。

如何使用SGML Parser在HTML中提取特定文本

1 个答案: