我将Web爬网程序部署到了AWS Lambda。然后在测试时,它第一次正确运行,但是第二次出现此错误。 在AWS Lambda中引发error.reactornotrestartable()twisted.internet.error.reactornotrestartable
File "/var/task/main.py", line 19, in run_spider
reactor.run()
File "/var/task/twisted/internet/base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/var/task/twisted/internet/base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "/var/task/twisted/internet/base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
该搜寻器在我的本地python环境下运行良好。我试图在main.py内部运行的功能是
def run_spider(event, s):
given_links = []
print(given_links)
for t in event["Records"]:
given_links.append(t["body"])
runner = CrawlerRunner(s)
deferred = runner.crawl('spider', crawl_links=given_links)
deferred.addCallback(lambda _: reactor.stop())
reactor.run()
def lambda_handler(event, context=None):
s = get_project_settings()
s['FEED_FORMAT'] = 'csv'
s['FEED_URI'] = '/tmp/output.csv'
run_spider(event, s)
事件如下所示:
{
"Records": [
{
"body": "https://example.com"
}
]
}
最初,我使用的是CrawlerProcess而不是CrawlerRunner,但它也给出了相同的错误。然后,在查看了StackOverflow上的一些答案之后,我将代码更改为使用CrawlerRunner。有人还建议使用钩针编织,我尝试过并出现此错误:
ValueError: signal only works in main thread in scrapy
我该怎么办才能解决此错误?
答案 0 :(得分:0)
我在 AWS lambda 上遇到错误 ReactorNotRestartable
,在我使用此解决方案后
默认情况下,scrapy
的异步特性不适用于 Cloud Functions,因为我们需要一种方法来阻止抓取以防止函数提前返回和实例在之前被杀死进程终止。
相反,我们可以使用 scrapydo
以阻塞方式运行您现有的蜘蛛:
import scrapy
import scrapy.crawler as crawler
rom scrapy.spiders import CrawlSpider
import scrapydo
scrapydo.setup()
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
scrapydo.run_spider(QuotesSpider)