嘿,我刚刚开始使用Scrapy。我正在为网站popupstore提供基本信息。本网站使用ajax请求获取与json格式的单个产品相关的所有数据。这是我的代码
` def parse_item(self, response):
self.n += 1
print("inside parse_item => ", self.n)
popupitem = PopupItem()
popupitem["url"] = response.url
item_desc_api = self.get_item_desc_api(response)
print("url to call =>", item_desc_api)
# calling api url to get items description
yield scrapy.Request(item_desc_api, callback=self.parse_item_from_api,
meta={"popupitem": popupitem})
def parse_item_from_api(self, response):
self.m += 1
print("inside parse_item_from_api =>",self.m)
popupitem = response.meta["popupitem"]
jsonresponse = json.loads(response.body_as_unicode())
yield popupitem
我使用了两个变量n和m来显示调用parse_item(n)的次数以及调用parse_item_from_api(n)
问题
当我运行此代码时,它会显示n - > 116和m->只有。在处理所有产生的请求之前,程序退出,并且只有37个项目存储在output.JSON文件中。 如何确保在程序退出之前处理所有已产生的请求
Scrapy Logs
2017-06-13 13:37:40 [scrapy.core.engine] INFO: Closing spider
(finished)
2017-06-13 13:37:40 [scrapy.extensions.feedexport] INFO: Stored json
feed (37 items) in: out.json
2017-06-13 13:37:40 [scrapy.statscollectors] INFO: Dumping Scrapy
stats:
{'downloader/request_bytes': 93446,
'downloader/request_count': 194,
'downloader/request_method_count/GET': 194,
'downloader/response_bytes': 1808706,
'downloader/response_count': 194,
'downloader/response_status_count/200': 193,
'downloader/response_status_count/301': 1,
'dupefilter/filtered': 154,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 6, 13, 8, 37, 40, 576449),
'item_scraped_count': 37,
'log_count/DEBUG': 233,
'log_count/INFO': 8,
'request_depth_max': 3,
'response_received_count': 193,
'scheduler/dequeued': 193,
'scheduler/dequeued/memory': 193,
'scheduler/enqueued': 193,
'scheduler/enqueued/memory': 193,
'start_time': datetime.datetime(2017, 6, 13, 8, 37, 17, 124336)}
2017-06-13 13:37:40 [scrapy.core.engine] INFO: Spider closed (
finished)
答案 0 :(得分:0)
创建您要制作的所有请求的列表
all_requests = ['https://website.com/1', 'https://website.com/2', 'https://website.com/3']
link = all_requests.pop() # extract the one request to make
# make first request
yield Request(url = link, callback = self.prase_1, meta = {'remaining_links' : all_requests, data = []}
def parse_1(self, response):
data = []
data = data.extend( response.meta['data] )
... GRAB YOUR DATA FROM RESPONSE
remaining_links = response.meta['remaining_links']
# if there are more requests to make
if len(remaining_links) > 0:
link = remaining_links.pop() # extract one request to make
yield Request(url = link, callback = self.prase_1, meta = {'remaining_links' : remaining_links, data = data}
else:
yield data