Question

我很想在页面中搜索信息，最终该信息在页面中不可用，但是它具有包含所需数据的链接。因此，我想让我的抓取工具在后续链接中访问此数据。我想出了一个方法，但是还不能使它起作用。

我正在努力使产品变得更容易 -向后续链接提出新请求 -抓取数据 -返回数据 -使用所有数据，脚本将继续执行下一个步骤

问题在于，先前的代码（称为新请求）没有等待第二个请求，而是由于丢失了该数据而最终导致错误。

这是代码（经过简化，以反映我的要求）

def parse_hearing_page(self, response):
    item = response.meta['item']
    selector = XPATH_PATTERN[self.page]
    for p in response.xpath(selector):
        date = ' '.join(p.xpath('strong/text()').extract())
        text = paragraph.xpath('.//text()').extract()
        scraped_data = {
            'date': parse_date(date),
            'author': None,
            'content': text,
            'url': response.url,
        }

        if date not in text:
            subsequent_page = p.xpath('../ul/li/a/@href').get()
            yield Request(urljoin(response.url, subsequent_page), meta=scraped_data, callback=self.parse_author)
        else:
            author = text.split('Authors')[1]         
            scraped_data['location'] = author
        if scraped_data['author'] is None:
            raise ValueError
        scraped_data['id'] = self.get_id_from_author(scraped_data['author'], scraped_data['url'])
        yield scraped_data   

def parse_author(self, response):
    selector = ('//div//article[@id="authors"]/text()')
    item = response.meta['item']
    author = response.xpath(selector)
    item['author'] = author
    yield item

在没有可用数据的页面中，代码将引发错误，因为它没有等待第二个为scraped_data ['author']刮取数据的请求。

我相信这可能是标准行为，但是有没有办法阻止它？还是解决这类问题的另一种方法？

Scrapy不等待第二个请求在后续链接中刮取数据

0 个答案: