如何使用urllib2.urlopen获取所有页面(来自分页)?

时间:2016-06-22 11:51:06

标签: python parsing pagination

我想获取所有页面并解析它们,直到没有加载更多url作为响应(链接变量)。任何人都可以告诉我如何修改以下代码,以便它获取所有页面,直到没有加载更多的URL作为响应(链接变量)?

import urllib2,re

Fromurl = "https://somesite.com/n/series/123456/";

req = urllib2.Request(Fromurl)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36')
response = urllib2.urlopen(req)
link=response.read()
print "value of link here:"
print link
response.close()

//try to parse response here

#################### here we get next page url###################
matchNextPageUrl =re.findall('loadmore', link, re.UNICODE)
print "value of matchNextPageUrl [0][0]"

print "https://somesite.com/n/series/nexpage"+matchNextPageUrl[0][0]+".sort-number:DESC.pageNumber-1";

0 个答案:

没有答案