Question

我正在构建一个带有可选登录的递归webspider。我想通过json配置文件使大多数设置动态化。

在我的UIButton函数中，我正在阅读此文件并尝试填充所有变量，但是，这不适用于shiftButton.contentMode = .Center shiftButton.imageView?.contentMode = .ScaleAspectFit。

__init__

Scrapy仍会抓取Rules中存在的网页，因此也会点击class CrawlpySpider(InitSpider): ... #---------------------------------------------------------------------- def __init__(self, *args, **kwargs): """Constructor: overwrite parent __init__ function""" # Call parent init super(CrawlpySpider, self).__init__(*args, **kwargs) # Get command line arg provided configuration param config_file = kwargs.get('config') # Validate configuration file parameter if not config_file: logging.error('Missing argument "-a config"') logging.error('Usage: scrapy crawl crawlpy -a config=/path/to/config.json') self.abort = True # Check if it is actually a file elif not os.path.isfile(config_file): logging.error('Specified config file does not exist') logging.error('Not found in: "' + config_file + '"') self.abort = True # All good, read config else: # Load json config fpointer = open(config_file) data = fpointer.read() fpointer.close() # convert JSON to dict config = json.loads(data) # config['rules'] is simply a string array which looks like this: # config['rules'] = [ # 'password', # 'reset', # 'delete', # 'disable', # 'drop', # 'logout', # ] CrawlpySpider.rules = ( Rule( LinkExtractor( allow_domains=(self.allowed_domains), unique=True, deny=tuple(config['rules']) ), callback='parse', follow=False ), )页面。因此，指定的页面不会被拒绝。我在这里缺少什么？

更新

我已尝试在config['rules']内设置logout和CrawlpySpider.rules = ...。两种变体都不起作用。

蜘蛛：self.rules = ...
规则：__init__
抓取前：先登录登录

我甚至试图在InitSpider函数

中否认这一点

LinkExtractor

Answer 1

您正在设置要在其中设置实例属性的类属性：

# this:
CrawlpySpider.rules = (
# should be this:
self.rules = (
<...>

scrapy InitSpider：在init中设置规则？

1 个答案:

scrapy InitSpider：在__init__中设置规则？

1 个答案:

scrapy InitSpider：在init中设置规则？