我需要按顺序发送我的请求与Scrapy。
def n1(self, response) :
#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
for (elem,) in self.input :
link = urljoin(path,elem)
yield Request(link)
我的问题是请求的顺序不正确。 我读过this question,但没有正确答案。
如何更改我的代码以按顺序发送请求?
更新1
我使用优先级并将代码更改为
def n1(self, response) :
#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
self.prio = len(self.input)
for (elem,) in self.input :
self.prio -= 1
link = urljoin(path,elem)
yield Request(link, priority=self.prio)
我为此蜘蛛设置
custom_settings = {
'DOWNLOAD_DELAY' : 0,
'COOKIES_ENABLED' : True,
'CONCURRENT_REQUESTS' : 1 ,
'AUTOTHROTTLE_ENABLED' : False,
}
现在顺序已更改,但不是数组中元素的顺序
答案 0 :(得分:2)
使用return
语句代替yield
。
您甚至不需要触摸任何设置:
from scrapy.spiders import Spider, Request
class MySpider(Spider):
name = 'toscrape.com'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']
urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)
def parse(self, response):
for url in self.urls:
return Request(url)
输出:
2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)
2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)
2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)
2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)
2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)
2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)
2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)
2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)
2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)
2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)
2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)
2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)
2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)
2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)
2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)
2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)
2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)
2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)
2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)
2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)
2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)
2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)
2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)
通过yield
语句,引擎从生成器获取所有响应,并以任意顺序执行它们(我怀疑它们可能以某种形式存储以删除重复项)。
答案 1 :(得分:0)
我认为并发请求正在此处播放。您可以尝试设置
custom_settings = {
'CONCURRENT_REQUESTS': 1
}
默认设置为8。这将解释为什么当您有其他工人上班时为什么不优先考虑优先级。
答案 2 :(得分:0)
只有在收到前一个请求后,才能发送下一个请求:
class MainSpider(Spider):
urls = [
'https://www.url1...',
'https://www.url2...',
'https://www.url3...',
]
def start_requests(self):
yield Request(
url=self.urls[0],
callback=self.parse,
meta={'next_index': 1},
)
def parse(self, response):
next_index = response.meta['next_index']
# do something with response...
# Process next url
if next_index < len(self.urls):
yield Request(
url=self.urls[next_index],
callback=self.parse,
meta={'next_index': next_index+1},
)