我正在试图抓取http://targetstudy.com/school/schools-in-chhattisgarh.html
我正在使用lxml.html,urllib2
我想以某种方式,通过单击下一页链接并下载其源来关注所有页面。 并在最后一页停止。 下一页的href是['?recNo = 25']
有人可以建议怎么做, 提前谢谢。
这是我的代码,
import urllib2
import lxml.html
import itertools
url = "http://targetstudy.com/school/schools-in-chhattisgarh.html"
req = urllib2.Request(url, headers={ 'User-Agent': 'Mozilla/5.0' })
stuff = urllib2.urlopen(req).read().encode('ascii', 'ignore')
tree = lxml.html.fromstring(stuff)
print stuff
links = tree.xpath("(//ul[@class='pagination']/li/a)[last()]/@href")
for link in links:
req = urllib2.Request(url, headers={ 'User-Agent': 'Mozilla/5.0' })
stuff = urllib2.urlopen(req).read().encode('ascii', 'ignore')
tree = lxml.html.fromstring(stuff)
print stuff
links = tree.xpath("(//ul[@class='pagination']/li/a)[last()]/@href")
但它所做的一切都是进入第二页而不是更进一步。
请帮帮我
答案 0 :(得分:1)
我希望您的所有问题都来自于在循环结束时覆盖您的列表。假设其余代码有效,这可能是更好的解决方案。
import urllib2
import lxml.html
import itertools
url = "http://targetstudy.com/school/schools-in-chhattisgarh.html"
req = urllib2.Request(url, headers={ 'User-Agent': 'Mozilla/5.0' })
stuff = urllib2.urlopen(req).read().encode('ascii', 'ignore')
tree = lxml.html.fromstring(stuff)
print stuff
links = [url]
visited = []
while len(links) > 0:
# take a link out of the list and mark it as visited
link = links.pop()
visited.append(link)
# open the link and read the contents
req = urllib2.Request(link, headers={ 'User-Agent': 'Mozilla/5.0' })
stuff = urllib2.urlopen(req).read().encode('ascii', 'ignore')
tree = lxml.html.fromstring(stuff)
print stuff
# for every link in the page
for new_link in tree.xpath("(//ul[@class='pagination']/li/a)[last()]/@href"):
# if link has not been visited yet and is not in the list to visit next
if new_link not in links and new_link not in visited:
# add the new link to the list of links to visit
links.append(new_link)