如何使用scrapy通过发布不同的数据来抓取相同的网址?

时间:2015-10-15 03:56:19

标签: python post scrapy

我希望通过发布不同的页码来抓取一个网站,但我只获取第一页的数据然后蜘蛛完成了,我想也许会抓取同一个url,它按scrappy过滤。
这是我的代码:

class ZhejiangCrawl(Spider):
    name = 'ZhejiangCrawl'
    root_url= 'http://www.zjsfgkw.cn/Execute/CreditCompany'
    start_page = 1
    current_page = start_page
    end_page = 24974
    post_data = {'PageNo': str(current_page), 'PageSize': '5', 'ReallyName': '', 'CredentialsNumber': '', 'AH': '',
                      'ZXFY': '', 'StartLARQ': '','EndLARQ':''}
    headers = HEADER
    cookies = COOKIES

    def start_requests(self):
        return [FormRequest(self.root_url, headers=self.headers, cookies=self.cookies, formdata=self.post_data, dont_filter=True,
                        callback=self.parse)]

    def parse(self, response):
        if self.current_page < self.end_page:
            self.current_page += 1
            self.post_data['PageNo'] = str(self.current_page)
            yield [FormRequest(self.root_url, headers=self.headers, cookies=self.cookies, dont_filter=True,
                           formdata=self.post_data, callback=self.parse)]

        jsonstr = json.loads(response.body)
        for item_dict in jsonstr['informationmodels']:
            item = ZhejiangcrawlItem()
            item['name'] = item_dict['ReallyName']
            item['cardNum'] = item_dict['CredentialsNumber']
            item['performance'] = item_dict['ZXJE']
            item['unperformance'] = item_dict['WZXJE']
            item['gistUnit'] = item_dict['ZXFY']
            item['address'] = item_dict['Address']
            item['gistId'] = item_dict['ZXYJ']
            item['caseCode'] = item_dict['AH']
            item['regDate'] = item_dict['LARQ']
            item['exposureDate'] = item_dict['BGRQ']
            item['gistReason'] = item_dict['ZXAY']
            yield item

如何解决?

1 个答案:

答案 0 :(得分:0)

如果您认为由于DupeFilter而被过滤,请将dont_filter=True添加到FormRequests

另外需要注意的是,没有理由从您的屈服/返回内容中制作列表。