我是python的新手。我想在scrapy spider类中创建自己的类实例variable_1, variable_2
。以下代码运行良好。
class SpiderTest1(scrapy.Spider):
name = 'main run'
url = 'url example' # this class variable working find
variable_1 = 'info_1' # this class variable working find
variable_2 = 'info_2' # this class variable working find
def start_requests(self):
urls = [self.url]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print (f'some process with {self.variable_1}')
print (f'some prcesss with {self.variable_2}')
# start run the class
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1())
process.start()
但是我想让它成为类实例变量,这样我每次运行它时都不必在Spider内部修改变量的值。我决定将def __init__(self, url, varialbe_1, variable_2)
创建到刮板蜘蛛中,并且我希望使用SpiderTest1(url, variable_1, variable_2)
来运行它。以下是我希望像上面的代码一样产生的新代码,但是效果不佳:
class SpiderTest1(scrapy.Spider):
name = 'main run'
# the following __init__ are new change, but not working fine
def __init__(self, url, variable_1, variable_2):
self.url = url
self.variable_1 = variable_1
self.variable_2 = variable_2
def start_requests(self):
urls = [self.url]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(f'some process with {self.variable_1}')
print(f'some prcesss with {self.variable_2}')
# input values into variables
url = 'url example'
variable_1 = 'info_1'
variable_2 = 'info_2'
# start run the class
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1(url, variable_1, variable_2)) #it seem this code doesn't work
process.start()
结果:
TypeError: __init__() missing 3 required positional arguments: 'url', 'variable_1', and 'variable_2'
感谢任何人都可以告诉您如何实现它。
答案 0 :(得分:0)
根据Common Practices和API documentation,您应该像这样调用crawl
方法,以将参数传递给Spider构造函数:
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1, url, variable_1, variable_2)
process.start()
更新: 该文档还提到了运行蜘蛛的这种形式:
process.crawl('followall', domain='scrapinghub.com')
在这种情况下,'followall'
是项目中Spider的名称(即Spider类的name
属性的值)。在特定情况下,您可以按照以下方式定义蜘蛛网:
class SpiderTest1(scrapy.Spider):
name = 'main run'
...
您将使用以下代码使用蜘蛛名称运行蜘蛛:
process = CrawlerProcess(get_project_settings())
process.crawl('main run', url, variable_1, variable_2)
process.start()
答案 1 :(得分:0)
谢谢,我的代码可以正常工作。 但是我发现事情与Common Practices
略有不同这是我们的代码:
process.crawl(SpiderTest1, url, variable_1, variable_2)
这是来自Common Practices
process.crawl('followall', domain='scrapinghub.com')
建议的第一个变量使用类的名称SpiderTest1
,而另一个使用字符串'followall'
'followall'
指的是什么?
它引用目录:testspiders/testspiders/spiders/followall.py
或仅在name = 'followall'
followall.py
我之所以这样问,是因为我仍然困惑于何时应该用刮spider的蜘蛛叫string
或class name
。
谢谢。