Python - 如何在alexa.com上正确使用htmlparser

时间:2015-03-29 20:13:13

标签: python django python-2.7

所以,我试图在我的应用程序中从 www.alexa.com/topsites/global 获取前20个网站,但我没有得到预期的结果。

到目前为止,我的代码使用HTMLParserurllib2

import HTMLParser, urllib2

class MyHTMLParser(HTMLParser.HTMLParser):
    def reset(self):
        HTMLParser.HTMLParser.reset(self)
        self.in_a = False
        self.next_link_text_pair = None
    def handle_starttag(self, tag, attrs):
        if tag=='a':
            for name, value in attrs:
                if name=='href':
                    self.next_link_text_pair = [value, '']
                    self.in_a = True
                    break
    def handle_data(self, data):
        if self.in_a: self.next_link_text_pair[1] += data
    def handle_endtag(self, tag):
        if tag=='a':
            if self.next_link_text_pair is not None:
                print self.next_link_text_pair
            self.next_link_text_pair = None
            self.in_a = False

if __name__=='__main__':
    p = MyHTMLParser()
    p.feed(urllib2.urlopen('http://www.alexa.com/topsites/global').read())

我得到的结果:

['/', '']
['/topsites', 'Browse Top Sites']
['/', 'Home']
['/plans', 'Plans and Pricing']
['/tools', 'Tools']
['/pro/dashboard', 'My Dashboard']
['/toolbar', 'Toolbar']
['/about', 'About Us']
['/support', 'Support']
['http://blog.alexa.com', 'Blog']
['/secure/login?resource=%2Ftopsites%2Fglobal', 'Sign In']
['/register?resource=%2Ftopsites%2Fglobal', 'Create an Account']
['/topsites/countries', 'By Country']
['/topsites/category', 'By Category']
['/siteinfo/google.com', 'Google.com']
['/siteinfo/facebook.com', 'Facebook.com']
['/siteinfo/youtube.com', 'Youtube.com']
['/siteinfo/baidu.com', 'Baidu.com']
['/siteinfo/yahoo.com', 'Yahoo.com']
['/siteinfo/wikipedia.org', 'Wikipedia.org']
['/siteinfo/amazon.com', 'Amazon.com']
['/siteinfo/twitter.com', 'Twitter.com']
['/siteinfo/taobao.com', 'Taobao.com']
['/siteinfo/qq.com', 'Qq.com']
['/siteinfo/google.co.in', 'Google.co.in']
['/siteinfo/linkedin.com', 'Linkedin.com']

我想摆脱第一个不受欢迎的结果,例如HomePlan and pricing等等,只获得前20个网站名称而没有{{1事情。

有人可以帮帮我吗? 我不想使用BeautifulSoup

1 个答案:

答案 0 :(得分:1)

您可以检查网址是否以/siteinfo/开头,以消除不相关的内容:

if self.next_link_text_pair is not None:
    if self.next_link_text_pair[0].startswith('/siteinfo/'):
        print self.next_link_text_pair[1]