我希望通过发布不同的页码来抓取一个网站,但我只获取第一页的数据然后蜘蛛完成了,我想也许会抓取同一个url
,它按scrappy
过滤。
这是我的代码:
class ZhejiangCrawl(Spider):
name = 'ZhejiangCrawl'
root_url= 'http://www.zjsfgkw.cn/Execute/CreditCompany'
start_page = 1
current_page = start_page
end_page = 24974
post_data = {'PageNo': str(current_page), 'PageSize': '5', 'ReallyName': '', 'CredentialsNumber': '', 'AH': '',
'ZXFY': '', 'StartLARQ': '','EndLARQ':''}
headers = HEADER
cookies = COOKIES
def start_requests(self):
return [FormRequest(self.root_url, headers=self.headers, cookies=self.cookies, formdata=self.post_data, dont_filter=True,
callback=self.parse)]
def parse(self, response):
if self.current_page < self.end_page:
self.current_page += 1
self.post_data['PageNo'] = str(self.current_page)
yield [FormRequest(self.root_url, headers=self.headers, cookies=self.cookies, dont_filter=True,
formdata=self.post_data, callback=self.parse)]
jsonstr = json.loads(response.body)
for item_dict in jsonstr['informationmodels']:
item = ZhejiangcrawlItem()
item['name'] = item_dict['ReallyName']
item['cardNum'] = item_dict['CredentialsNumber']
item['performance'] = item_dict['ZXJE']
item['unperformance'] = item_dict['WZXJE']
item['gistUnit'] = item_dict['ZXFY']
item['address'] = item_dict['Address']
item['gistId'] = item_dict['ZXYJ']
item['caseCode'] = item_dict['AH']
item['regDate'] = item_dict['LARQ']
item['exposureDate'] = item_dict['BGRQ']
item['gistReason'] = item_dict['ZXAY']
yield item
如何解决?
答案 0 :(得分:0)
如果您认为由于DupeFilter而被过滤,请将dont_filter=True
添加到FormRequests
。
另外需要注意的是,没有理由从您的屈服/返回内容中制作列表。