如何将蜘蛛顺序运行到在scrapy中使用会话的站点

时间:2017-07-13 15:01:38

标签: python-2.7 session scrapy ajaxform

我想抓一个网页,首先发送一个打开会话的AjaxFormPost,然后发送一个_SearchResultGridPopulate来填充我需要抓取的控件,响应是一个json。

这是我的代码片段:

def parse_AjaxFormPost(self, response):
        self.logger.info("parse_AjaxFormPost")
        page = response.meta['page']
        header = {
                'Accept':'*/*',
                'Accept-Encoding':'gzip, deflate, br',
                'Accept-Language':'en-US,en;q=0.8',
                'Connection':'keep-alive',
                'Content-Length':'14',
                'Content-Type':'application/x-www-form-urlencoded',
                'Cookie':'ASP.NET_SessionId=gq4dgcsl500y32xb1n2ciexq',
                .
                .
                .
            }
        url = '<url>/Search/AjaxFormPost'
        cities = ['city1','city2',...]
        for city in cities:
            formData = {
                        'City':city
            }
            re = scrapy.FormRequest(
            url,
            formdata=formData,
            headers=header,
            dont_filter=True,
            callback=self.parse_GridPopulate
            )
            yield re  

def parse_GridPopulate(self,response):
        self.logger.info("parse_LookupPermitTypeDetails")
        url = '<url>/Search//_SearchResultGridPopulate?Grid-page=2&Grid-size=10&Grid-CERT_KEYSIZE=128&Grid-CERT_SECRETKEYSIZE=2048&Grid-HTTPS_KEYSIZE=128&Grid-HTTPS_SECRETKEYSIZE=2048'
        header = {
                'Accept':'*/*',
                'Accept-Encoding':'gzip, deflate, br',
                'Accept-Language':'en-US,en;q=0.8',
                'Connection':'keep-alive',
                'Content-Length':'23',
                'Content-Type':'application/x-www-form-urlencoded',
                'Cookie':'ASP.NET_SessionId=gq4dgcsl500y32xb1n2ciexq',
                 .
                 .
                 .
        }
        formData = {
                    'page':'1',
                    'size':'10'
            }
        re = scrapy.FormRequest(
        url,
        formdata=formData,
        headers=header,
        dont_filter=True,
        callback=self.parse
        )

        yield re        


    def parse(self, response):
        self.logger.info("parse_permit")
        data_json = json.loads(response.body)
        for row in data_json["data"]:
            self.logger.info(row)
            item = RedmondPermitItem()
            item["item1"] = row["item1"]
            item["item2"] = row["item2"]
            yield item

问题是scrapy会请求并发和何时和parse_AjaxFormPost中的请求打开一个会话,所以当传递给parse_LookupPermitTypeDetails时,我得到了最后一个请求的会话,请在parse_AjaxFormPost中执行。所以,如果我最后有10个城市,我得到的是上一个城市的10倍信息。

在设置中,我更改了配置:

CONCURRENT_REQUESTS_PER_DOMAIN = 1
CONCURRENT_REQUESTS_PER_IP = 1

它不起作用。另一方面,我想在每次像

这样的时候只为一个城市运行蜘蛛
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
# Your first spider definition
...
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
    cities = ['city1','city2',...]
        for city in cities:
            yield runner.crawl(MySpider1,city=city)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished

也许这可能是唯一的解决方案,但我不确定。我想为具有这种特征的每个站点创建一个过程。

有关如何解决此问题的任何建议都可以实现此配置设置。

提前感谢。

UPDATE1 我更改标题,因为对于使用会话的网站来说非常重要

1 个答案:

答案 0 :(得分:1)

这是理解并发工作原理的一个问题,因为这不是你可以按顺序工作的并行性,而是在回调之间。我会建议这样的事情:

def parse_AjaxFormPost(self, response):
    ...
    cities = ['city1','city2',...]
    formData = {
                'City':cities[0]
    }
    re = scrapy.FormRequest(
        url,
        formdata=formData,
        headers=header,
        dont_filter=True,
        callback=self.parse_remaining_cities,
        meta={'remaining_cities': cities[1:]}, # check the meta argument
    )
    yield re

def parse_remaining_cities(self, response):
    remaining_cities = response.meta['remaining_cities']
    current_city = remaining_cities[0]
    ...
    yield Request(
        ..., 
        meta={'remaining_cities': remaining_cities[1:]}, 
        callback=self.parse_remaining_cities)

通过这种方式,您可以在城市之间一次又一次地执行一项请求。