如何在python的Scrapy中将参数传递给process.crawl

时间:2018-06-21 10:27:52

标签: python scrapy

我正在尝试将python的Scrapy库与IBM云功能一起使用。我想用process.crawl传递一些参数。我该怎么办?

我的代码如下:

class MySpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["http://quotes.toscrape.com/"]

    def __init__(self, make=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        init_url = "http://quotes.toscrape.com/"
        self.start_urls = [init_url]

    def parse(self, response):
        title = response.css(".header-box > div a::text").extract_first()
        yield {"title": title}


process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})

process.crawl(MySpider) <-------- Explanation
process.start()

说明

我发现here可以按以下步骤完成:

process.crawl(MySpider, make="Audi")

但是当我尝试这样做时,我的编辑器出现错误:

expected type 'dict' got 'str' instead

我在做什么错了?

更新

我将scrapy spider用于IBM云功能,因此我的代码如下:

import scrapy
from scrapy.crawler import CrawlerProcess


class MySpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["http://quotes.toscrape.com/"]

    def __init__(self, make=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        print("Make {}".format(make))

    def parse(self, response):
        title = response.css(".header-box > div a::text").extract_first()
        yield {"title": title}


def main(params):
    process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})

    process.crawl(MySpider, make="Audi") <------- in my editor I get here an warning expected type 'dict' got 'str' instead
    process.start()
    return {"joke": "Some shit joke"}

当我从控制台运行main({})时,出现以下错误:

  

2018-06-22 08:42:45 [scrapy.extensions.telnet]调试:Telnet控制台   侦听127.0.0.1:6024追溯(最近一次通话为最后一次):文件   “”,文件“ ./ 主要 .py”的第1行,第30行,   主文件   “ /Users/boris/Projects/IBM-cloud/virtualenv/lib/python3.6/site-packages/scrapy/crawler.py”,   第291行,开始时       Reactor.run(installSignalHandlers = False)#阻止调用文件“ /Users/boris/Projects/IBM-cloud/virtualenv/lib/python3.6/site-packages/twisted/internet/base.py”,   1260行,正在运行       self.startRunning(installSignalHandlers = installSignalHandlers)文件   “ /Users/boris/Projects/IBM-cloud/virtualenv/lib/python3.6/site-packages/twisted/internet/base.py”,   第1240行,在startRunning中       ReactorBase.startRunning(self)文件“ /Users/boris/Projects/IBM-cloud/virtualenv/lib/python3.6/site-packages/twisted/internet/base.py”,   第748行,在startRunning中       引发错误.ReactorNotRestartable()twisted.internet.error.ReactorNotRestartable

0 个答案:

没有答案