所以,我试图在我的应用程序中从 www.alexa.com/topsites/global 获取前20个网站,但我没有得到预期的结果。
到目前为止,我的代码使用HTMLParser
和urllib2
:
import HTMLParser, urllib2
class MyHTMLParser(HTMLParser.HTMLParser):
def reset(self):
HTMLParser.HTMLParser.reset(self)
self.in_a = False
self.next_link_text_pair = None
def handle_starttag(self, tag, attrs):
if tag=='a':
for name, value in attrs:
if name=='href':
self.next_link_text_pair = [value, '']
self.in_a = True
break
def handle_data(self, data):
if self.in_a: self.next_link_text_pair[1] += data
def handle_endtag(self, tag):
if tag=='a':
if self.next_link_text_pair is not None:
print self.next_link_text_pair
self.next_link_text_pair = None
self.in_a = False
if __name__=='__main__':
p = MyHTMLParser()
p.feed(urllib2.urlopen('http://www.alexa.com/topsites/global').read())
我得到的结果:
['/', '']
['/topsites', 'Browse Top Sites']
['/', 'Home']
['/plans', 'Plans and Pricing']
['/tools', 'Tools']
['/pro/dashboard', 'My Dashboard']
['/toolbar', 'Toolbar']
['/about', 'About Us']
['/support', 'Support']
['http://blog.alexa.com', 'Blog']
['/secure/login?resource=%2Ftopsites%2Fglobal', 'Sign In']
['/register?resource=%2Ftopsites%2Fglobal', 'Create an Account']
['/topsites/countries', 'By Country']
['/topsites/category', 'By Category']
['/siteinfo/google.com', 'Google.com']
['/siteinfo/facebook.com', 'Facebook.com']
['/siteinfo/youtube.com', 'Youtube.com']
['/siteinfo/baidu.com', 'Baidu.com']
['/siteinfo/yahoo.com', 'Yahoo.com']
['/siteinfo/wikipedia.org', 'Wikipedia.org']
['/siteinfo/amazon.com', 'Amazon.com']
['/siteinfo/twitter.com', 'Twitter.com']
['/siteinfo/taobao.com', 'Taobao.com']
['/siteinfo/qq.com', 'Qq.com']
['/siteinfo/google.co.in', 'Google.co.in']
['/siteinfo/linkedin.com', 'Linkedin.com']
我想摆脱第一个不受欢迎的结果,例如Home
和Plan and pricing
等等,只获得前20个网站名称而没有{{1事情。
有人可以帮帮我吗? 我不想使用BeautifulSoup 。
答案 0 :(得分:1)
您可以检查网址是否以/siteinfo/
开头,以消除不相关的内容:
if self.next_link_text_pair is not None:
if self.next_link_text_pair[0].startswith('/siteinfo/'):
print self.next_link_text_pair[1]