Question

我是python的新手。我想在scrapy spider类中创建自己的类实例variable_1, variable_2。以下代码运行良好。

class SpiderTest1(scrapy.Spider):

    name       = 'main run'
    url        = 'url example'  # this class variable working find
    variable_1 = 'info_1'       # this class variable working find
    variable_2 = 'info_2'       # this class variable working find

    def start_requests(self):
        urls = [self.url]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print (f'some process with {self.variable_1}')
        print (f'some prcesss with {self.variable_2}')


# start run the class
process = CrawlerProcess(get_project_settings())   
process.crawl(SpiderTest1())
process.start()

但是我想让它成为类实例变量，这样我每次运行它时都不必在Spider内部修改变量的值。我决定将def __init__(self, url, varialbe_1, variable_2)创建到刮板蜘蛛中，并且我希望使用SpiderTest1(url, variable_1, variable_2)来运行它。以下是我希望像上面的代码一样产生的新代码，但是效果不佳：

class SpiderTest1(scrapy.Spider):

    name = 'main run'

    # the following __init__ are new change, but not working fine
    def __init__(self, url, variable_1, variable_2):
        self.url = url                 
        self.variable_1 = variable_1
        self.variable_2 = variable_2

    def start_requests(self):
        urls = [self.url]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print(f'some process with {self.variable_1}')
        print(f'some prcesss with {self.variable_2}')

# input values into variables
url        = 'url example'  
variable_1 = 'info_1'       
variable_2 = 'info_2' 

# start run the class
process = CrawlerProcess(get_project_settings())   
process.crawl(SpiderTest1(url, variable_1, variable_2)) #it seem this code doesn't work
process.start()

结果：

TypeError: __init__() missing 3 required positional arguments: 'url', 'variable_1', and 'variable_2'

感谢任何人都可以告诉您如何实现它。

Answer 1

根据Common Practices和API documentation，您应该像这样调用crawl方法，以将参数传递给Spider构造函数：

process = CrawlerProcess(get_project_settings())   
process.crawl(SpiderTest1, url, variable_1, variable_2)
process.start()

更新：该文档还提到了运行蜘蛛的这种形式：

process.crawl('followall', domain='scrapinghub.com')

在这种情况下，'followall'是项目中Spider的名称（即Spider类的name属性的值）。在特定情况下，您可以按照以下方式定义蜘蛛网：

class SpiderTest1(scrapy.Spider):
    name = 'main run'
    ...

您将使用以下代码使用蜘蛛名称运行蜘蛛：

process = CrawlerProcess(get_project_settings())   
process.crawl('main run', url, variable_1, variable_2)
process.start()

Answer 2

谢谢，我的代码可以正常工作。但是我发现事情与Common Practices

略有不同

这是我们的代码：

process.crawl(SpiderTest1, url, variable_1, variable_2)

这是来自Common Practices

process.crawl('followall', domain='scrapinghub.com')

建议的第一个变量使用类的名称SpiderTest1，而另一个使用字符串'followall'

'followall'指的是什么？它引用目录：testspiders/testspiders/spiders/followall.py或仅在name = 'followall'

下的类变量followall.py

我之所以这样问，是因为我仍然困惑于何时应该用刮spider的蜘蛛叫string或class name。

谢谢。

创建类实例变量到scrapy spider

2 个答案: