如果我不知道蜘蛛何时完成,何时返回一个物品?

时间:2015-10-23 01:26:14

标签: web-scraping scrapy scrapy-spider

因此,我的蜘蛛会收集一系列网站,并通过request item meta传递item yield来抓取每个网站。

然后,蜘蛛会探索单个网站的所有内部链接,并将所有外部链接收集到item中。问题是我不知道蜘蛛什么时候爬完所有内部链接,所以我不能class WebsiteSpider(scrapy.Spider): name = "web" def start_requests(self): filename = "websites.csv" requests = [] try: with open(filename, 'r') as csv_file: reader = csv.reader(csv_file) header = next(reader) for row in reader: seed_url = row[1].strip() item = Links(base_url=seed_url, on_list=[]) request = Request(seed_url, callback=self.parse_seed) request.meta['item'] = item requests.append(request) return requests except IOError: raise scrapy.exceptions.CloseSpider("A list of websites are needed") def parse_seed(self, response): item = response.meta['item'] netloc = urlparse(item['base_url']).netloc external_le = LinkExtractor(deny_domains=netloc) external_links = external_le.extract_links(response) for external_link in external_links: item['on_list'].append(external_link) internal_le = LinkExtractor(allow_domains=netloc) internal_links = internal_le.extract_links(response) for internal_link in internal_links: request = Request(internal_link, callback=self.parse_seed) request.meta['item'] = item yield request git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

    Cloning into 'linux'...
    error: Couldn't resolve host 'git.kernel.org' while accessing https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/info/refs
    fatal: HTTP request failed

1 个答案:

答案 0 :(得分:0)

start_requests方法需要生成Request个对象。您不需要返回请求列表,但只在请求准备就绪时才生成请求,这是因为scrapy请求是异步的。

项目也是如此,你只需要在你认为物品准备就绪时就产生你的物品,我建议您只检查是否有internal_links来生产物品,或者您可以根据需要添加任意数量的项目,然后检查哪一项是最后一项(或具有更多数据的项目):

class WebsiteSpider(scrapy.Spider):
    name = "web"

    def start_requests(self):
        filename = "websites.csv"
        requests = []
        try:
            with open(filename, 'r') as csv_file:
                reader = csv.reader(csv_file)
                header = next(reader)
                for row in reader:
                    seed_url = row[1].strip()
                    item = Links(base_url=seed_url, on_list=[])
                    yield Request(seed_url, callback=self.parse_seed, meta = {'item'=item})
        except IOError:
            raise scrapy.exceptions.CloseSpider("A list of websites are needed")

    def parse_seed(self, response):
        item = response.meta['item']
        netloc = urlparse(item['base_url']).netloc
        external_le = LinkExtractor(deny_domains=netloc)
        external_links = external_le.extract_links(response)
        for external_link in external_links:
            item['on_list'].append(external_link)

        internal_le = LinkExtractor(allow_domains=netloc)
        internal_links = internal_le.extract_links(response)
        if internal_links:
            for internal_link in internal_links:
                request = Request(internal_link, callback=self.parse_seed)
                request.meta['item'] = item
                yield request
        else:
            yield item

你可以做的另一件事是创建一个extension来做你需要的spider_closed方法,并做任何你想知道什么时候蜘蛛结束。