如果网页包含多个页面,如何提取URL链接?

时间:2015-06-24 02:57:33

标签: python web-scraping

我正在尝试从Blogspot中提取所选部分的URL链接。

不幸的是,当我运行我的代码时,它只会提取第一页的URL而忽略其他页面。

例如,http://ellywonderland.blogspot.com/

中有9个网址链接和2个网页

但是这些代码只能在第一页中提取7个网址链接,而忽略下一页中的其他网址链接。

代码:

import urllib2 # Please look into the requests library
from bs4 import BeautifulSoup

# Renaming your variables
links_file = open("link.txt")
# No need to split them yourself, theres a method for that
links = link_file.readlines() 


for url in links:
    htmltext = urllib2.urlopen(url).read()
    soup = BeautifulSoup(htmltext)

    # describe your actions with variable names
    relative_tags_to_desired_links = soup.find_all("h3", "post-title entry-title", "name")

    for tag in relative_tags_to_desired_links:
        desired_element = tag.next_element.next_element
        print desired_element.get("href")

'link.txt'包含网址列表(例如):

http://ellywonderland.blogspot.com/

输出:

 http://ellywonderland.blogspot.com/2011/03/my-vintage-pre-wedding.html
 http://ellywonderland.blogspot.com/2011/02/pre-wedding-vintage.html
 http://ellywonderland.blogspot.com/2010/12/tissue-paper-flower-crepe-paper.html
 http://ellywonderland.blogspot.com/2010/12/menguap-menurut-islam.html
 http://ellywonderland.blogspot.com/2010/12/weddings-idea.html
 http://ellywonderland.blogspot.com/2010/12/kawin.html
 http://ellywonderland.blogspot.com/2010/11/vitamin-c-collagen.html

那么,你能帮助我如何阅读Blogspot中的所有页面吗?

谢谢。

0 个答案:

没有答案