我正在尝试解析一个类名为class =“link”的html,我的问题是想要读取变量中的每一行然后解析它,但它应该使用三引号,我怎样才能创建一个字符串具有三重引用风格的变量。感谢。
from HTMLParser import HTMLParser
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered a start tag:", tag
def handle_endtag(self, tag):
print "Encountered an end tag :", tag
def handle_data(self, data):
print "Encountered some data :", data
# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
var = open('./index.html','r')
strings = var.read()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
好吧,如果我从本地文件中读取内容,我该如何解析字符串var?
的index.html:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<title>Document</title>
</head>
<body>
<div class="row">
<h1>hello world</h1>
<div class="row">
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Id, excepturi, consequatur sed nobis facere veritatis tempore qui ipsum enim dignissimos!</p>
</div>
</div>
</body>
</html>
如果我将这个html作为一个大字符串阅读,我该如何解析它,我只想获取h1标签中的内容。谢谢你的时间。
答案 0 :(得分:0)
h1 = false
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
## print "Encountered a start tag:", tag
if tag == 'h1':
h1 = true
def handle_endtag(self, tag):
## print "Encountered an end tag :", tag
if tag == 'h1':
h1 = false
def handle_data(self, data):
## print "Encountered some data :", data
if h1:
print data