我正在尝试使用'pages_crawled'
属性获取蜘蛛抓取的页面数。但是,无论我尝试哪种网站,我都会得到pages_crawled
=无。
以下是我的代码:
from threading import Thread
from selenium import webdriver
from urlparse import urlparse
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.settings import Settings
from scrapy.crawler import Crawler
from scrapy.http.request import Request
from scrapy.statscol import StatsCollector
class MySpider(CrawlSpider):
name = "mySpider"
def get_url():
url = raw_input('Enter the url of your website (including the http)')
return url
start_url = str(get_url())
extractor = SgmlLinkExtractor()
rules = (Rule(extractor, callback='web_crawling',follow=True),)
def web_crawling(self):
settingObj=Settings()
crawler=Crawler(settingObj)
stat = StatsCollector(crawler)
depth = stat.get_value('pages_crawled')
return depth
为什么我一直得到无价值?
谢谢!
答案 0 :(得分:0)
第一个更改行
start_url = str(get_url())
这样start_url就是一个列表:
start_urls = [str(get_url())]
然后更改行
def web_crawling(self):
通过向解析函数添加一个附加参数:
def web_crawling(self, response):
这是您的代码的工作版本:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.settings import Settings
from scrapy.crawler import Crawler
from scrapy.statscol import StatsCollector
class MySpider(CrawlSpider):
name = "mySpider"
def get_url():
url = raw_input('Enter the url of your website (including the http)')
return url
start_urls = [str(get_url())]
extractor = SgmlLinkExtractor()
rules = (Rule(extractor, callback='web_crawling',follow=True),)
def web_crawling(self, response):
settingObj=Settings()
crawler=Crawler(settingObj)
stat = StatsCollector(crawler)
depth = stat.get_value('pages_crawled')
return depth
要测试上述代码,请将其保存在文件spidey.py
中。然后使用VirtualEnv安装Scrapy并使用Scrapy可执行文件运行它,如下所示:
virtualenv venv
venv/bin/pip install scrapy
venv/bin/scrapy runspider spidey.py
答案 1 :(得分:0)
最好将其添加为sc extension
实际上,将extensions.py添加到scrapy项目的根目录:
from scrapy import signals
class ExtensionThatAccessStats(object):
def __init__(self, stats):
self.stats = stats
@classmethod
def from_crawler(cls, crawler):
# instantiate the extension object
ext = cls(crawler.stats)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_closed(self, spider):
spider.log("spider stats %s" % self.stats.get_stats())
来自蜘蛛日志:
>>> [dmoz] DEBUG: spider stats {'log_count/DEBUG': 52,
'scheduler/dequeued': 2, 'log_count/INFO': 3,
'downloader/response_count': 2, 'downloader/response_status_count/200': 2,
'response_received_count': 2, 'scheduler/enqueued/memory': 2,
'downloader/response_bytes': 14892, 'finish_reason': 'finished',
'start_time': datetime.datetime(2013, 12, 3, 17, 50, 41, 253140),
'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2,
'finish_time': datetime.datetime(2013, 12, 3, 17, 50, 41, 544793),
'downloader/request_bytes': 530, 'downloader/request_method_count/GET': 2,
'downloader/request_count': 2, 'item_scraped_count': 44}
要使扩展程序处于活动状态,请不要忘记将以下内容添加到项目的settings.py:
中EXTENSIONS = {
'project_name.extensions.ExtensionThatAccessStats': 100,
}