我是python编程的新手,我很难获得python爬行脚本工作。我需要你的提示来解决它。 实际上,我有一个工作的scrapy脚本,它遍历给定的URL并提取链接。我想让它适用于任何动态给定的URL。所以我开始通过命令行将启动URL和域传递给scrapy,如下所示。
scrapy crawl myCrawler -o test.json -t json -a allowedDomains="xxx" -a startUrls="xxx" -a allowedPaths="xxx"
然而,它不起作用。看起来规则没有从参数中获取值。由于我缺乏python技能,我无法弄清楚如何解决这个问题。有人请帮助我。
以下是代码段。
class DmozSpider(CrawlSpider):
name = "myCrawler"
def __init__(self, allowedDomains='', startUrls='',allowedPaths='', *args, **kwargs):
super(DmozSpider, self).__init__(*args, **kwargs)
self.allowedDomains = allowedDomains
self.startUrls = startUrls
self.allowedPaths = allowedPaths
self.allowed_domains = [allowedDomains]
self.start_urls = [startUrls]
rules = (Rule(LinkExtractor(allow=(allowedPaths), allow_domains=allowedDomains), callback="parse_items",
follow=True),)
答案 0 :(得分:0)
幸运的是,它在How to dynamically set Scrapy rules?
找到答案这是工作代码
class DmozSpider(CrawlSpider):
name = "myCrawler"
def __init__(self, allowedDomains='', startUrls='',allowedPaths='', *args, **kwargs):
super(DmozSpider, self).__init__(*args, **kwargs)
self.allowedDomains = allowedDomains
self.startUrls = startUrls
self.allowedPaths = allowedPaths
self.allowed_domains = [allowedDomains]
self.start_urls = [startUrls]
DmozSpider.rules = (Rule(LinkExtractor(allow=(allowedPaths), allow_domains=allowedDomains), callback="parse_items",
follow=True),)
super(DmozSpider, self)._compile_rules()