我使用HTMLParser来计算http://www.worldgolf.com/courses/usa/massachusetts/
中的h2标记数量这是代码:
class City2Parser(HTMLParser):
def handle_starttag(self,tag,attrs):
if tag == 'h2':
print 'h2'
req = urllib2.Request('http://www.worldgolf.com/courses/usa/massachusetts/')
html = urllib2.urlopen(req)
parser = City2Parser()
parser.feed(html.read())
它只打印一次,为什么?显然页面有三个h2标签
答案 0 :(得分:1)
看看会发生什么。
>>> from HTMLParser import HTMLParser
>>> import urllib2
>>> class City2Parser(HTMLParser):
... def handle_starttag(self,tag,attrs):
... if tag == 'h2':
... print 'h2'
...
>>> req = urllib2.Request('http://www.worldgolf.com/courses/usa/massachusetts/')
>>> html = urllib2.urlopen(req)
>>> parser = City2Parser()
>>> parser.feed(html.read())
h2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/HTMLParser.py", line 109, in feed
self.goahead(0)
File "/usr/lib/python2.7/HTMLParser.py", line 151, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/HTMLParser.py", line 232, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.7/HTMLParser.py", line 307, in check_for_whole_start_tag
self.error("malformed start tag")
File "/usr/lib/python2.7/HTMLParser.py", line 116, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 249, column 30
抱怨HTML <br style="clear:left;"
无效。 HTMLParser关心如何获得有效的HTML。
答案 1 :(得分:1)
你必须在你的City2Parser
中实现一堆处理程序来处理HTMLParser似乎没有开箱即用的标签和javascript的混乱。你为什么不改用BeautiflSoup这样的东西:
from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen('http://www.worldgolf.com/courses/usa/massachusetts/')
soup = BeautifulSoup(page)
s = soup.findAll('h2')
print len(s)
for t in s:
print t.text
给出:
3
Featured Massachusetts Golf Course
Golf Locations
Latest user ratings for Massachusetts golf courses
除非重点是使用HTMLParser。