Question

我正在尝试使用'pages_crawled'属性获取蜘蛛抓取的页面数。但是，无论我尝试哪种网站，我都会得到pages_crawled =无。以下是我的代码：

from threading import Thread 
from selenium import webdriver
from urlparse import urlparse
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.settings import Settings
from scrapy.crawler import Crawler
from scrapy.http.request import Request
from scrapy.statscol import StatsCollector

class MySpider(CrawlSpider):
    name = "mySpider"
    def get_url():
        url = raw_input('Enter the url of your website (including the http)')
        return url 
    start_url = str(get_url())
    extractor = SgmlLinkExtractor()
    rules = (Rule(extractor, callback='web_crawling',follow=True),)
    def web_crawling(self):
       settingObj=Settings()    
       crawler=Crawler(settingObj)
       stat = StatsCollector(crawler)
       depth = stat.get_value('pages_crawled')
       return depth

为什么我一直得到无价值？

谢谢！

Answer 1

第一个更改行

start_url = str(get_url())

这样start_url就是一个列表：

start_urls = [str(get_url())]

然后更改行

def web_crawling(self):

通过向解析函数添加一个附加参数：

def web_crawling(self, response):

这是您的代码的工作版本：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.settings import Settings
from scrapy.crawler import Crawler
from scrapy.statscol import StatsCollector

class MySpider(CrawlSpider):
    name = "mySpider"
    def get_url():
        url = raw_input('Enter the url of your website (including the http)')
        return url 
    start_urls = [str(get_url())]
    extractor = SgmlLinkExtractor()
    rules = (Rule(extractor, callback='web_crawling',follow=True),)
    def web_crawling(self, response):
       settingObj=Settings()    
       crawler=Crawler(settingObj)
       stat = StatsCollector(crawler)
       depth = stat.get_value('pages_crawled')
       return depth

要测试上述代码，请将其保存在文件spidey.py中。然后使用VirtualEnv安装Scrapy并使用Scrapy可执行文件运行它，如下所示：

virtualenv venv
venv/bin/pip install scrapy
venv/bin/scrapy runspider spidey.py

Answer 2

最好将其添加为sc extension

另见stats extension

实际上，将extensions.py添加到scrapy项目的根目录：

from scrapy import signals

class ExtensionThatAccessStats(object):

    def __init__(self, stats):
        self.stats = stats

    @classmethod
    def from_crawler(cls, crawler):
        # instantiate the extension object
        ext = cls(crawler.stats)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        return ext

    def spider_closed(self, spider):
        spider.log("spider stats %s" % self.stats.get_stats())

来自蜘蛛日志：

>>> [dmoz] DEBUG: spider stats {'log_count/DEBUG': 52, 
'scheduler/dequeued': 2, 'log_count/INFO': 3, 
'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 
'response_received_count': 2, 'scheduler/enqueued/memory': 2, 
'downloader/response_bytes': 14892, 'finish_reason': 'finished', 
'start_time': datetime.datetime(2013, 12, 3, 17, 50, 41, 253140),      
'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 
'finish_time': datetime.datetime(2013, 12, 3, 17, 50, 41, 544793), 
'downloader/request_bytes': 530, 'downloader/request_method_count/GET': 2, 
'downloader/request_count': 2, 'item_scraped_count': 44}

要使扩展程序处于活动状态，请不要忘记将以下内容添加到项目的settings.py:

中

EXTENSIONS = {
    'project_name.extensions.ExtensionThatAccessStats': 100,
}

简单的scrapy蜘蛛不会爬行

2 个答案: