我正在尝试从Blogspot中提取所选部分的URL链接。
不幸的是,当我运行我的代码时,它只会提取第一页的URL而忽略其他页面。
例如,http://ellywonderland.blogspot.com/
中有9个网址链接和2个网页但是这些代码只能在第一页中提取7个网址链接,而忽略下一页中的其他网址链接。
代码:
import urllib2 # Please look into the requests library
from bs4 import BeautifulSoup
# Renaming your variables
links_file = open("link.txt")
# No need to split them yourself, theres a method for that
links = link_file.readlines()
for url in links:
htmltext = urllib2.urlopen(url).read()
soup = BeautifulSoup(htmltext)
# describe your actions with variable names
relative_tags_to_desired_links = soup.find_all("h3", "post-title entry-title", "name")
for tag in relative_tags_to_desired_links:
desired_element = tag.next_element.next_element
print desired_element.get("href")
'link.txt'包含网址列表(例如):
http://ellywonderland.blogspot.com/
输出:
http://ellywonderland.blogspot.com/2011/03/my-vintage-pre-wedding.html
http://ellywonderland.blogspot.com/2011/02/pre-wedding-vintage.html
http://ellywonderland.blogspot.com/2010/12/tissue-paper-flower-crepe-paper.html
http://ellywonderland.blogspot.com/2010/12/menguap-menurut-islam.html
http://ellywonderland.blogspot.com/2010/12/weddings-idea.html
http://ellywonderland.blogspot.com/2010/12/kawin.html
http://ellywonderland.blogspot.com/2010/11/vitamin-c-collagen.html
那么,你能帮助我如何阅读Blogspot中的所有页面吗?
谢谢。