从命令行将参数传递给scrapy规则或动态修改规则

时间:2017-03-26 12:47:02

标签: python scrapy

我是python编程的新手,我很难获得python爬行脚本工作。我需要你的提示来解决它。 实际上,我有一个工作的scrapy脚本,它遍历给定的URL并提取链接。我想让它适用于任何动态给定的URL。所以我开始通过命令行将启动URL和域传递给scrapy,如下所示。

scrapy crawl myCrawler -o test.json -t json -a allowedDomains="xxx" -a startUrls="xxx" -a allowedPaths="xxx"

然而,它不起作用。看起来规则没有从参数中获取值。由于我缺乏python技能,我无法弄清楚如何解决这个问题。有人请帮助我。

以下是代码段。

class DmozSpider(CrawlSpider):

  name = "myCrawler"

   def __init__(self, allowedDomains='', startUrls='',allowedPaths='', *args, **kwargs):
       super(DmozSpider, self).__init__(*args, **kwargs)
       self.allowedDomains = allowedDomains
       self.startUrls = startUrls
       self.allowedPaths = allowedPaths
       self.allowed_domains = [allowedDomains]
       self.start_urls = [startUrls]

   rules = (Rule(LinkExtractor(allow=(allowedPaths), allow_domains=allowedDomains), callback="parse_items",
                     follow=True),)

1 个答案:

答案 0 :(得分:0)

幸运的是,它在How to dynamically set Scrapy rules?

找到答案

这是工作代码

class DmozSpider(CrawlSpider):

  name = "myCrawler"

   def __init__(self, allowedDomains='', startUrls='',allowedPaths='', *args, **kwargs):
       super(DmozSpider, self).__init__(*args, **kwargs)
       self.allowedDomains = allowedDomains
       self.startUrls = startUrls
       self.allowedPaths = allowedPaths
       self.allowed_domains = [allowedDomains]
       self.start_urls = [startUrls]
       DmozSpider.rules = (Rule(LinkExtractor(allow=(allowedPaths), allow_domains=allowedDomains), callback="parse_items",
                     follow=True),)
       super(DmozSpider, self)._compile_rules()