我正在使用脚本文件在scrapy项目中运行蜘蛛,而蜘蛛正在记录爬虫输出/结果。但我想在某个函数中使用该脚本文件中的spider输出/结果。我不想将输出/结果保存在任何文件或数据库中。 这是脚本代码来自https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())
d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()
def spider_output(output):
# do something to that output
如何在蜘蛛网输出中获得蜘蛛输出?方法。可以获得输出/结果。
答案 0 :(得分:6)
以下是将所有输出/结果列入列表的解决方案
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.signalmanager import dispatcher
def spider_results():
results = []
def crawler_results(signal, sender, item, response, spider):
results.append(item)
dispatcher.connect(crawler_results, signal=signals.item_passed)
process = CrawlerProcess(get_project_settings())
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
return results
if __name__ == '__main__':
print(spider_results())
答案 1 :(得分:1)
AFAIK无法做到这一点,因为crawl():
返回爬网结束时触发的延迟。
除了将结果输出到记录器之外,爬虫不会将结果存储在任何地方。
然而,返回的输出会与scrapy的整个异步性质和结构发生冲突,因此保存到文件然后阅读它是一种优先的方法。
您可以简单地设计将项目保存到文件的管道,只需阅读spider_output
中的文件即可。您将收到结果,因为reactor.run()
阻止了您的脚本,直到输出文件仍然完整。
答案 2 :(得分:1)
这是一个古老的问题,仅供以后参考。如果您使用的是Python 3.6+,我建议您使用scrapyscript,它可以让您以超级简单的方式运行Spider并获取结果:
from scrapyscript import Job, Processor
from scrapy.spiders import Spider
from scrapy import Request
import json
# Define a Scrapy Spider, which can accept *args or **kwargs
# https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
class PythonSpider(Spider):
name = 'myspider'
def start_requests(self):
yield Request(self.url)
def parse(self, response):
title = response.xpath('//title/text()').extract()
return {'url': response.request.url, 'title': title}
# Create jobs for each instance. *args and **kwargs supplied here will
# be passed to the spider constructor at runtime
githubJob = Job(PythonSpider, url='http://www.github.com')
pythonJob = Job(PythonSpider, url='http://www.python.org')
# Create a Processor, optionally passing in a Scrapy Settings object.
processor = Processor(settings=None)
# Start the reactor, and block until all spiders complete.
data = processor.run([githubJob, pythonJob])
# Print the consolidated results
print(json.dumps(data, indent=4))
[
{
"title": [
"Welcome to Python.org"
],
"url": "https://www.python.org/"
},
{
"title": [
"The world's leading software development platform \u00b7 GitHub",
"1clr-code-hosting"
],
"url": "https://github.com/"
}
]
答案 3 :(得分:0)
我的建议是使用Python subprocess
模块从脚本运行Spider,而不是使用scrapy docs中提供的方法从python脚本运行Spider。这样做的原因是,借助subprocess
模块,您可以从蜘蛛内部捕获print
的输出/日志,甚至语句。
在Python 3中,使用run
方法执行蜘蛛程序。例如。
import subprocess
process = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if process.returncode == 0:
result = process.stdout.decode('utf-8')
else:
# code to check error using 'process.stderr'
将stdout / stderr设置为subprocess.PIPE
将允许捕获输出,因此设置此标志非常重要。
这里的command
应该是一个序列或一个字符串(它是一个字符串,然后使用另外1个参数调用run
方法:shell=True
)。例如:
command = ['scrapy', 'crawl', 'website', '-a', 'customArg=blahblah']
# or
command = 'scrapy crawl website -a customArg=blahblah' # with shell=True
#or
import shlex
command = shlex.split('scrapy crawl website -a customArg=blahblah') # without shell=True
此外,process.stdout
将包含脚本的输出,但类型为bytes
。您需要使用str
decode('utf-8')
答案 4 :(得分:0)
它将返回列表中蜘蛛的所有结果。
from scrapyscript import Job, Processor
from scrapy.utils.project import get_project_settings
def get_spider_output(spider, **kwargs):
job = Job(spider, **kwargs)
processor = Processor(settings=get_project_settings())
return processor.run([job])