废料 - 等待直到所有产生的请求都完成

时间:2017-06-13 07:18:07

标签: python python-3.x web-scraping scrapy

嘿,我刚刚开始使用Scrapy。我正在为网站popupstore提供基本信息。本网站使用ajax请求获取与json格式的单个产品相关的所有数据。这是我的代码

 `  def parse_item(self, response):
        self.n += 1
        print("inside parse_item => ", self.n)

        popupitem = PopupItem()
        popupitem["url"] = response.url
        item_desc_api = self.get_item_desc_api(response)
        print("url to call =>", item_desc_api)
        # calling api url to get items description
        yield scrapy.Request(item_desc_api, callback=self.parse_item_from_api,
                         meta={"popupitem": popupitem})




    def parse_item_from_api(self, response):
        self.m += 1
        print("inside parse_item_from_api =>",self.m)
        popupitem = response.meta["popupitem"]
        jsonresponse = json.loads(response.body_as_unicode())
        yield popupitem

我使用了两个变量n和m来显示调用parse_item(n)的次数以及调用parse_item_from_api(n)

问题

当我运行此代码时,它会显示n - > 116和m->只有。在处理所有产生的请求之前,程序退出,并且只有37个项目存储在output.JSON文件中。 如何确保在程序退出之前处理所有已产生的请求

Scrapy Logs

2017-06-13 13:37:40 [scrapy.core.engine] INFO: Closing spider 
(finished)
2017-06-13 13:37:40 [scrapy.extensions.feedexport] INFO: Stored json 
feed (37 items) in: out.json
2017-06-13 13:37:40 [scrapy.statscollectors] INFO: Dumping Scrapy 
stats:
{'downloader/request_bytes': 93446,
'downloader/request_count': 194,
'downloader/request_method_count/GET': 194,
'downloader/response_bytes': 1808706,
'downloader/response_count': 194,
'downloader/response_status_count/200': 193,
'downloader/response_status_count/301': 1,
'dupefilter/filtered': 154,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 6, 13, 8, 37, 40, 576449),
'item_scraped_count': 37,
'log_count/DEBUG': 233,
'log_count/INFO': 8,
'request_depth_max': 3,
'response_received_count': 193,
'scheduler/dequeued': 193,
'scheduler/dequeued/memory': 193,
'scheduler/enqueued': 193,
'scheduler/enqueued/memory': 193,
'start_time': datetime.datetime(2017, 6, 13, 8, 37, 17, 124336)}
2017-06-13 13:37:40 [scrapy.core.engine] INFO: Spider closed ( 
finished)

1 个答案:

答案 0 :(得分:0)

创建您要制作的所有请求的列表

all_requests = ['https://website.com/1', 'https://website.com/2', 'https://website.com/3']

link = all_requests.pop() # extract the one request to make

# make first request
yield Request(url = link, callback = self.prase_1, meta = {'remaining_links' : all_requests, data = []}

def parse_1(self, response):

    data = []

    data = data.extend( response.meta['data] ) 

    ... GRAB YOUR DATA FROM RESPONSE

    remaining_links = response.meta['remaining_links']


    # if there are more requests to make
    if len(remaining_links) > 0:
        link = remaining_links.pop() # extract one request to make

        yield Request(url = link, callback = self.prase_1, meta = {'remaining_links' : remaining_links, data = data}

    else:

        yield data