我正在尝试从多个页面中抓取图片网址。这工作正常,它在我的命令提示符下输出所有URL。但是当我运行蜘蛛时,只下载了40-70张图像,应该扫描的图像数量为4000.
import scrapy
from stscrape.items import ImageItem
import logging
import json
from PIL import Image
class StSpider(scrapy.Spider):
name = "lst"
def start_requests(self):
urls = [
'https://www.example.com/api/?subcategory=lts&page=1'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
logging.warning('in page')
jsonresponse = json.loads(response.body_as_unicode())
pages = jsonresponse['data']['pagination']['total_pages']
for page in range(pages):
yield scrapy.Request(url='https://www.example.com/api/?subcategory=lts&page=' + str(page), callback=self.page)
def page(self, response):
pageresponse = json.loads(response.body_as_unicode())
items = pageresponse['data']['segments'][0]['segment_items']
image_url = []
for item in items:
image_url.append(item['product_card']['image_url'])
yield ImageItem(image_urls=image_url)

我认为这个问题与图片下载过程有关。我觉得它会以某种方式阻止更多的图像被下载。
修改
这就是scrapy输出
2018-01-23 21:43:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 91540,
'downloader/request_count': 160,
'downloader/request_method_count/GET': 160,
'downloader/response_bytes': 1313499,
'downloader/response_count': 160,
'downloader/response_status_count/200': 160,
'dupefilter/filtered': 1,
'file_count': 55,
'file_status_count/downloaded': 55,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 1, 23, 20, 43, 20, 693000),
'item_scraped_count': 99,
'log_count/DEBUG': 316,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 160,
'scheduler/dequeued': 100,
'scheduler/dequeued/memory': 100,
'scheduler/enqueued': 100,
'scheduler/enqueued/memory': 100,
'start_time': datetime.datetime(2018, 1, 23, 20, 43, 5, 45000)}
2018-01-23 21:43:20 [scrapy.core.engine] INFO: Spider closed (finished)
我认为问题在于,一旦处理了1个URL并且图像已放置在image_urls中,它们将使用Yield进行处理。这会以某种方式阻止其他图像被处理。