我有一个类在init之前运行一些代码:
class NoFollowSpider(CrawlSpider):
rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
callback="parse_items", follow= True),
)
def __init__(self, moreparams=None, *args, **kwargs):
super(NoFollowSpider, self).__init__(*args, **kwargs)
self.moreparams = moreparams
我使用以下命令运行此scrapy代码:
> scrapy runspider my_spider.py -a moreparams="more parameters" -o output.txt
现在,我希望可以从命令行配置名为 rules 的静态变量:
> scrapy runspider my_spider.py -a crawl=True -a moreparams="more parameters" -o output.txt
将 init 更改为:
def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
if (crawl_pages is True):
self.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items", follow= True),
)
self.moreparams = moreparams
但是,如果我在init中切换静态变量 rules ,则scrapy不再考虑它:它运行,但只抓取给定的start_urls而不是整个域。似乎规则必须是静态类变量。
那么,我该如何动态设置静态变量?
答案 0 :(得分:6)
所以这就是我在@Not_a_Golfer和@nramirezuy的帮助下解决问题的方法,我只是简单地使用了他们建议的两点:
class NoFollowSpider(CrawlSpider):
def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
super(NoFollowSpider, self).__init__(*args, **kwargs)
# Set the class member from here
if (crawl_pages is True):
NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items", follow= True),)
# Then recompile the Rules
super(NoFollowSpider, self)._compile_rules()
# Keep going as before
self.moreparams = moreparams
谢谢大家的帮助!
答案 1 :(得分:2)
嗯,你有两种选择。更简单的一点 - 我不确定它是否可行但只是在构造函数中使用类而不是self
来设置规则:
def __init__(self, session_id=-1, crawl_pages=False, allowed_domains=None, start_urls=None, xpath=None, contains = None, doesnotcontain=None, *args, **kwargs):
#You simply set the class member from here
NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
callback="parse_items", follow= True),)
我不确定scrapy是否会尊重它 - 这取决于它何时读取这些规则。但值得一试。
另一种更复杂的方法是使用元类。基本上,您可以干预创建类的方式,而不仅仅是其实例。注意元类'在运行任何代码之前,__new__
方法在导入时间 >>。
class MyType(type):
"""
A Meta class that creates classes
"""
@staticmethod
def __new__(cls, name, bases, dict):
ret = type.__new__(cls, name, bases, dict)
# whatever you want to do - do it here. You can peek into
# the command line args for example
ret.rules = (....)
return ret
class MyClass(object):
"""
Now comes the actual class, with the __metaclass__ identifier.
This means that when we create the class definition we call the metaclass' __new__
"""
__metaclass__ = MyType
def __init__(self):
pass
答案 2 :(得分:1)
在您定义规则之前,规则为compiled。
答案 3 :(得分:1)
class NoFollowSpider(CrawlSpider):
def __init__(self, crawl_pages=False, moreparams=None, *a, **kw):
if (crawl_pages is True):
NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
callback="parse_items", follow= True),)
# No need to call "_compile_rules()" manually, it's called in __init__ of the parent
super(NoFollowSpider, self).__init__(*a, **kw)
# Keep going as before
self.moreparams = moreparams
答案 4 :(得分:0)
如何动态设置静态变量?
我不知道scrapy,但是你有什么理由不能使用课堂方法吗?
class NoFollowSpider(CrawlSpider):
rules = ( Rule (SgmlLinkExtractor(allow=("", ),),\
callback="parse_items", follow= True),)
@classmethod
def set_rules(klass,rules)
klass.rules = rules
请注意,rules
不是静态变量,而是class attribute。
编辑 - 这是另一种可能在一开始就设置它的方法。应该允许你避免做_compile_rules(),
,我觉得它更清洁:
class NoFollowSpider(CrawlSpider):
def __new__(klass, crawl_pages=False, moreparams=None, *args, **kwargs):
if crawl_pages:
klass.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),\
callback="parse_items", follow= True),)
return super(NoFollowSpider,klass).__new__(klass,*args,**kwargs)
def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
super(NoFollowSpider, self).__init__(*args, **kwargs)
self.moreparams = moreparams
答案 5 :(得分:0)
我是用Scrapy 1.0做的,它确实有效。请注意,您只能在初始Spider实例化时信任kwargs。
class LinuxFoundationSpider(CrawlSpider):
year = None
def __init__(self, category=None, *args, **kwargs):
monthly_thread_xpath = 'date\.html'
if kwargs.get('year'):
LinuxFoundationSpider.year = kwargs['year']
if LinuxFoundationSpider.year:
monthly_thread_xpath = '%s.*?(\\/date\\.html)' % LinuxFoundationSpider.year
LinuxFoundationSpider.rules = (
Rule(LinkExtractor(allow=(monthly_thread_xpath,))),
Rule(LinkExtractor(restrict_xpaths=('//ul[2]/li/a[1]',)),
callback='parse_entry', follow=False),
)
super(LinuxFoundationSpider, self).__init__(*args, **kwargs)