我遵循了这两篇文章的建议,因为我也在尝试创建一个通用的scrapy蜘蛛:
How to pass a user defined argument in scrapy spider
Creating a generic scrapy spider
但是我收到的错误是我应该作为参数传递的变量没有定义。我在 init 方法中遗漏了什么?
代码:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from data.items import DataItem
class companySpider(BaseSpider):
name = "woz"
def __init__(self, domains=""):
'''
domains is a string
'''
self.domains = domains
deny_domains = [""]
start_urls = [domains]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('/html')
items = []
for site in sites:
item = DataItem()
item['text'] = site.select('text()').extract()
items.append(item)
return items
这是我的命令行:
scrapy crawl woz -a domains="http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
这是错误:
NameError: name 'domains' is not defined
答案 0 :(得分:5)
您应该在super(companySpider, self).__init__(*args, **kwargs)
的开头致电__init__
。
def __init__(self, domains="", *args, **kwargs):
super(companySpider, self).__init__(*args, **kwargs)
self.domains = domains
如果您的第一个请求依赖于spider参数,我通常只会覆盖start_requests()
方法,而不会覆盖__init__()
。命令行中的参数名称可用作蜘蛛的属性:
class companySpider(BaseSpider):
name = "woz"
deny_domains = [""]
def start_requests(self):
yield Request(self.domains) # for example if domains is a single URL
def parse(self, response):
...