我想抓一个网页,首先发送一个打开会话的AjaxFormPost,然后发送一个_SearchResultGridPopulate来填充我需要抓取的控件,响应是一个json。
这是我的代码片段:
def parse_AjaxFormPost(self, response):
self.logger.info("parse_AjaxFormPost")
page = response.meta['page']
header = {
'Accept':'*/*',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'en-US,en;q=0.8',
'Connection':'keep-alive',
'Content-Length':'14',
'Content-Type':'application/x-www-form-urlencoded',
'Cookie':'ASP.NET_SessionId=gq4dgcsl500y32xb1n2ciexq',
.
.
.
}
url = '<url>/Search/AjaxFormPost'
cities = ['city1','city2',...]
for city in cities:
formData = {
'City':city
}
re = scrapy.FormRequest(
url,
formdata=formData,
headers=header,
dont_filter=True,
callback=self.parse_GridPopulate
)
yield re
def parse_GridPopulate(self,response):
self.logger.info("parse_LookupPermitTypeDetails")
url = '<url>/Search//_SearchResultGridPopulate?Grid-page=2&Grid-size=10&Grid-CERT_KEYSIZE=128&Grid-CERT_SECRETKEYSIZE=2048&Grid-HTTPS_KEYSIZE=128&Grid-HTTPS_SECRETKEYSIZE=2048'
header = {
'Accept':'*/*',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'en-US,en;q=0.8',
'Connection':'keep-alive',
'Content-Length':'23',
'Content-Type':'application/x-www-form-urlencoded',
'Cookie':'ASP.NET_SessionId=gq4dgcsl500y32xb1n2ciexq',
.
.
.
}
formData = {
'page':'1',
'size':'10'
}
re = scrapy.FormRequest(
url,
formdata=formData,
headers=header,
dont_filter=True,
callback=self.parse
)
yield re
def parse(self, response):
self.logger.info("parse_permit")
data_json = json.loads(response.body)
for row in data_json["data"]:
self.logger.info(row)
item = RedmondPermitItem()
item["item1"] = row["item1"]
item["item2"] = row["item2"]
yield item
问题是scrapy会请求并发和何时和parse_AjaxFormPost中的请求打开一个会话,所以当传递给parse_LookupPermitTypeDetails时,我得到了最后一个请求的会话,请在parse_AjaxFormPost中执行。所以,如果我最后有10个城市,我得到的是上一个城市的10倍信息。
在设置中,我更改了配置:
CONCURRENT_REQUESTS_PER_DOMAIN = 1
CONCURRENT_REQUESTS_PER_IP = 1
它不起作用。另一方面,我想在每次像
这样的时候只为一个城市运行蜘蛛from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
# Your first spider definition
...
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
cities = ['city1','city2',...]
for city in cities:
yield runner.crawl(MySpider1,city=city)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
也许这可能是唯一的解决方案,但我不确定。我想为具有这种特征的每个站点创建一个过程。
有关如何解决此问题的任何建议都可以实现此配置设置。
提前感谢。
UPDATE1 我更改标题,因为对于使用会话的网站来说非常重要
答案 0 :(得分:1)
这是理解并发工作原理的一个问题,因为这不是你可以按顺序工作的并行性,而是在回调之间。我会建议这样的事情:
def parse_AjaxFormPost(self, response):
...
cities = ['city1','city2',...]
formData = {
'City':cities[0]
}
re = scrapy.FormRequest(
url,
formdata=formData,
headers=header,
dont_filter=True,
callback=self.parse_remaining_cities,
meta={'remaining_cities': cities[1:]}, # check the meta argument
)
yield re
def parse_remaining_cities(self, response):
remaining_cities = response.meta['remaining_cities']
current_city = remaining_cities[0]
...
yield Request(
...,
meta={'remaining_cities': remaining_cities[1:]},
callback=self.parse_remaining_cities)
通过这种方式,您可以在城市之间一次又一次地执行一项请求。