scrapy InitSpider:在__init__中设置规则?

时间:2016-08-08 10:28:27

标签: scrapy rules scrapy-spider

我正在构建一个带有可选登录的递归webspider。我想通过json配置文件使大多数设置动态化。

在我的UIButton函数中,我正在阅读此文件并尝试填充所有变量,但是,这不适用于shiftButton.contentMode = .Center shiftButton.imageView?.contentMode = .ScaleAspectFit

__init__

Scrapy仍会抓取Rules中存在的网页,因此也会点击class CrawlpySpider(InitSpider): ... #---------------------------------------------------------------------- def __init__(self, *args, **kwargs): """Constructor: overwrite parent __init__ function""" # Call parent init super(CrawlpySpider, self).__init__(*args, **kwargs) # Get command line arg provided configuration param config_file = kwargs.get('config') # Validate configuration file parameter if not config_file: logging.error('Missing argument "-a config"') logging.error('Usage: scrapy crawl crawlpy -a config=/path/to/config.json') self.abort = True # Check if it is actually a file elif not os.path.isfile(config_file): logging.error('Specified config file does not exist') logging.error('Not found in: "' + config_file + '"') self.abort = True # All good, read config else: # Load json config fpointer = open(config_file) data = fpointer.read() fpointer.close() # convert JSON to dict config = json.loads(data) # config['rules'] is simply a string array which looks like this: # config['rules'] = [ # 'password', # 'reset', # 'delete', # 'disable', # 'drop', # 'logout', # ] CrawlpySpider.rules = ( Rule( LinkExtractor( allow_domains=(self.allowed_domains), unique=True, deny=tuple(config['rules']) ), callback='parse', follow=False ), ) 页面。因此,指定的页面不会被拒绝。我在这里缺少什么?

更新

我已尝试在config['rules']内设置logoutCrawlpySpider.rules = ...。两种变体都不起作用。

  • 蜘蛛:self.rules = ...
  • 规则:__init__
  • 抓取前:先登录登录

我甚至试图在InitSpider函数

中否认这一点
LinkExtractor

1 个答案:

答案 0 :(得分:0)

您正在设置要在其中设置实例属性的类属性:

# this:
CrawlpySpider.rules = (
# should be this:
self.rules = (
<...>