很容易抓取整个网站
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com']
def parse(self, response):
extractor =LinkExtractor(allow_domains='quotes.toscrape.com')
links = extractor.extract_links(response)
for link in links:
yield scrapy.Request(link.url, self.parse)
yield {'url': response.url}
但是如何返回单个值?链接总数。
答案 0 :(得分:0)
有关抓取的统计信息,请使用Scrapy Stats。
self.stats.inc_value('link_count')
统计信息将以spider.stats
的形式显示。
可以使用metadata() API从ScrapyCloud项目中恢复统计信息:
from scrapinghub import ScrapinghubClient
client = ScrapinghubClient()
pro = client.get_project(<PROJECT_ID>)
job = pro.jobs.get(<JOB_ID>)
stats = job.metadata.get('scrapystats')
。
>>> job.metadata.get('scrapystats')
...
'downloader/response_count': 104,
'downloader/response_status_count/200': 104,
'finish_reason': 'finished',
'finish_time': 1447160494937,
'item_scraped_count': 50,
'log_count/DEBUG': 157,
'log_count/INFO': 1365,
'log_count/WARNING': 3,
'memusage/max': 182988800,
'memusage/startup': 62439424,
...