Question

我正在尝试从html文件中解析一组特定的链接，但由于我使用的是HTMLParser，因此我无法访问层次结构树中的html信息，因此无法提取信息。

我的HTML如下：

<p class="mediatitle">
        <a class="bullet medialink" href="link/to/a/file">Some Content
        </a>
</p>

所以我需要的是提取所有其键为'href'且前一个属性为class =“bullet medialink”的值。换句话说，我只想要一个带有'bullet medialink'类标签的thode hrefs

到目前为止我尝试的是

from HTMLParser import HTMLParser
import urllib
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
    if(tag == 'a'):
        for (key,value) in attrs:
            if(value == 'bullet medialink'):
                print "attr:", key

p = MyHTMLParser()
f = urllib.urlopen("sample.html")
html = f.read()
p.feed(html)
p.close()

Answer 1

我想要Bs4。 Bs4是第三方html解析器。文档：http://www.crummy.com/software/BeautifulSoup/bs4/doc/

import urllib
from bs4 import BeautifulSoup

f = urllib.urlopen("sample.html")
html = f.read()
soup = BeautifulSoup(html)
for atag in soup.select('.bullet.medialink'):  # Just enter a css-selector here
    print atag['href']  # You can also get an atrriibute with atag.get('href')

或更短：

import urllib
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib.urlopen("sample.html").read())
for atag in soup.select('.bullet.medialink'):
    print atag

Answer 2

所以我最后用一个简单的布尔标志来做，因为HTMLParser不是一个分层解析器包。

这是代码

from HTMLParser import HTMLParser
import urllib
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
    if(tag == 'a'):
        flag = 0
        for (key,value) in attrs:
                if(value == 'bullet medialink' and key == 'class'):
                    flag =1
                if(key == 'href' and flag == 1):    
                    print "link : ",value
                    flag = 0        

p = MyHTMLParser()
f = urllib.urlopen("sample.html")
html = f.read()
p.feed(html)
p.close()

希望有人提出更优雅的解决方案。

使用python中的HTMLParser解析html中的特定链接？

2 个答案: