我正在使用scrapy来获取数据,我想使用flask web框架在网页中显示结果。但我不知道如何在烧瓶应用程序中调用蜘蛛。我曾尝试使用CrawlerProcess
来调用我的蜘蛛,但我得到了这样的错误:
ValueError
ValueError: signal only works in main thread
Traceback (most recent call last)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1836, in __call__
return self.wsgi_app(environ, start_response)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1820, in wsgi_app
response = self.make_response(self.handle_exception(e))
File "/Library/Python/2.7/site-packages/flask/app.py", line 1403, in handle_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1817, in wsgi_app
response = self.full_dispatch_request()
File "/Library/Python/2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
rv = self.dispatch_request()
File "/Library/Python/2.7/site-packages/flask/app.py", line 1461, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/Users/Rabbit/PycharmProjects/Flask_template/FlaskTemplate.py", line 102, in index
process = CrawlerProcess()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 210, in __init__
install_shutdown_handlers(self._signal_shutdown)
File "/Library/Python/2.7/site-packages/scrapy/utils/ossignal.py", line 21, in install_shutdown_handlers
reactor._handleSignals()
File "/Library/Python/2.7/site-packages/twisted/internet/posixbase.py", line 295, in _handleSignals
_SignalReactorMixin._handleSignals(self)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1154, in _handleSignals
signal.signal(signal.SIGINT, self.sigInt)
ValueError: signal only works in main thread
我的scrapy代码如下:
class EPGD(Item):
genID = Field()
genID_url = Field()
taxID = Field()
taxID_url = Field()
familyID = Field()
familyID_url = Field()
chromosome = Field()
symbol = Field()
description = Field()
class EPGD_spider(Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = "man"
start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]
db = DB_Con()
collection = db.getcollection(name, term)
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
url_list = []
base_url = "http://epgd.biosino.org/EPGD"
for site in sites:
item = EPGD()
item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
self.collection.update({"genID":item['genID']}, dict(item), upsert=True)
yield item
sel_tmp = Selector(response)
link = sel_tmp.xpath('//span[@id="quickPage"]')
for site in link:
url_list.append(site.xpath('a/@href').extract())
for i in range(len(url_list[0])):
if cmp(url_list[0][i], "#") == 0:
if i+1 < len(url_list[0]):
print url_list[0][i+1]
actual_url = "http://epgd.biosino.org/EPGD/search/" + url_list[0][i+1]
yield Request(actual_url, callback=self.parse)
break
else:
print "The index is out of range!"
我的烧瓶代码如下:
@app.route('/', methods=['GET', 'POST'])
def index():
process = CrawlerProcess()
process.crawl(EPGD_spider)
return redirect(url_for('details'))
@app.route('/details', methods = ['GET'])
def epgd():
if request.method == 'GET':
results = db['EPGD_test'].find()
json_results= []
for result in results:
json_results.append(result)
return toJson(json_results)
使用烧瓶网框架时如何调用我的scrapy蜘蛛?
答案 0 :(得分:22)
在蜘蛛面前添加HTTP服务器并不容易。有几种选择。
如果您真的受限于Flask,如果您不能使用其他任何东西,只有将Scrapy与Flask集成的方法是为每个蜘蛛爬行启动外部进程,如其他答案建议的那样(请注意您的子进程需要在适当的Scrapy项目目录中产生)。
所有示例的目录结构应该如下所示,我使用dirbot test project
> tree -L 1
├── dirbot
├── README.rst
├── scrapy.cfg
├── server.py
└── setup.py
这是在新流程中启动Scrapy的代码示例:
# server.py
import subprocess
from flask import Flask
app = Flask(__name__)
@app.route('/')
def hello_world():
"""
Run spider in another process and store items in file. Simply issue command:
> scrapy crawl dmoz -o "output.json"
wait for this command to finish, and read output.json to client.
"""
spider_name = "dmoz"
subprocess.check_output(['scrapy', 'crawl', spider_name, "-o", "output.json"])
with open("output.json") as items_file:
return items_file.read()
if __name__ == '__main__':
app.run(debug=True)
将上面保存为server.py并访问localhost:5000,您应该可以看到已删除的项目。
其他更好的方法是使用一些现有项目,将Twisted与Werkzeug集成,并显示类似于Flask的API,例如: Twisted-Klein。 Twisted-Klein允许您在与Web服务器相同的进程中异步运行您的蜘蛛。它更好,因为它不会阻止每个请求,它允许您简单地从HTTP路由请求处理程序返回Scrapy / Twisted延迟。
以下片段将Twisted-Klein与Scrapy集成,请注意您需要创建自己的CrawlerRunner基类,以便抓取工具收集项目并将其返回给调用者。此选项更高级,您在与Python服务器相同的过程中运行Scrapy蜘蛛,项目不存储在文件中但存储在内存中(因此没有像前面示例中那样的磁盘写入/读取)。最重要的是,它是异步的,并且它都在一个Twisted反应堆中运行。
# server.py
import json
from klein import route, run
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from dirbot.spiders.dmoz import DmozSpider
class MyCrawlerRunner(CrawlerRunner):
"""
Crawler object that collects items and returns output after finishing crawl.
"""
def crawl(self, crawler_or_spidercls, *args, **kwargs):
# keep all items scraped
self.items = []
# create crawler (Same as in base CrawlerProcess)
crawler = self.create_crawler(crawler_or_spidercls)
# handle each item scraped
crawler.signals.connect(self.item_scraped, signals.item_scraped)
# create Twisted.Deferred launching crawl
dfd = self._crawl(crawler, *args, **kwargs)
# add callback - when crawl is done cal return_items
dfd.addCallback(self.return_items)
return dfd
def item_scraped(self, item, response, spider):
self.items.append(item)
def return_items(self, result):
return self.items
def return_spider_output(output):
"""
:param output: items scraped by CrawlerRunner
:return: json with list of items
"""
# this just turns items into dictionaries
# you may want to use Scrapy JSON serializer here
return json.dumps([dict(item) for item in output])
@route("/")
def schedule(request):
runner = MyCrawlerRunner()
spider = DmozSpider()
deferred = runner.crawl(spider)
deferred.addCallback(return_spider_output)
return deferred
run("localhost", 8080)
将上面保存在文件server.py中并在Scrapy项目目录中找到它, 现在打开localhost:8080,它将启动dmoz spider并将以json格式删除的项目返回浏览器。
当您尝试在蜘蛛面前添加HTTP应用程序时,会出现一些问题。例如,您有时需要处理蜘蛛日志(在某些情况下您可能需要它们),您需要以某种方式处理蜘蛛异常等。有些项目允许您以更简单的方式向蜘蛛添加HTTP API,例如ScrapyRT。这是一个将HTTP服务器添加到Scrapy蜘蛛并为您处理所有问题的应用程序(例如处理日志记录,处理蜘蛛错误等)。
所以在安装ScrapyRT后你只需要这样做:
> scrapyrt
在您的Scrapy项目目录中,它将启动HTTP服务器侦听您的请求。然后,您访问http://localhost:9080/crawl.json?spider_name=dmoz&url=http://alfa.com它应该启动您的蜘蛛为您抓取给定的网址。
免责声明:我是ScrapyRt的作者之一。
答案 1 :(得分:1)
仅当您以自包含方式使用爬虫时,此方法才有效。
如何使用子进程模块和subprocess.call()。
我用以下方式改变了你的蜘蛛,它起作用了。我没有数据库设置,因此这些行已被注释掉。
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.selector import Selector
from scrapy import Request
class EPGD(scrapy.Item):
genID = scrapy.Field()
genID_url = scrapy.Field()
taxID = scrapy.Field()
taxID_url = scrapy.Field()
familyID = scrapy.Field()
familyID_url = scrapy.Field()
chromosome = scrapy.Field()
symbol = scrapy.Field()
description = scrapy.Field()
class EPGD_spider(scrapy.Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = "man"
start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
url_list = []
base_url = "http://epgd.biosino.org/EPGD"
for site in sites:
item = EPGD()
item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
#self.collection.update({"genID":item['genID']}, dict(item), upsert=True)
yield item
sel_tmp = Selector(response)
link = sel_tmp.xpath('//span[@id="quickPage"]')
for site in link:
url_list.append(site.xpath('a/@href').extract())
for i in range(len(url_list[0])):
if cmp(url_list[0][i], "#") == 0:
if i+1 < len(url_list[0]):
print url_list[0][i+1]
actual_url = "http://epgd.biosino.org/EPGD/search/" + url_list[0][i+1]
yield Request(actual_url, callback=self.parse)
break
else:
print "The index is out of range!"
process = CrawlerProcess()
process.crawl(EPGD_spider)
process.start()
您应该可以在以下位置运行上述内容:
subprocess.check_output(['scrapy', 'runspider', "epgd.py"])
答案 2 :(得分:1)
问题在于反应堆无法重新启动。关于3个解决方案: 一种。搜寻程序 b。爬行者 C。子流程 我们可以使用CrawlerRunner和SubProcess,但是我们必须手动控制如何启动/停止反应堆。
我已经使用Flask(@ app.before_first_request)在任何请求之前注入逻辑来启动反应堆,
@app.before_first_request
def activate_job():
def run_job():
#time.sleep(0.5)
try:
if not reactor.running:
reactor.run()
except:
pass
thread = Thread(target=run_job)
thread.start()
然后,如果您想使用SubProcess:
# how to pass parameters: https://stackoverflow.com/questions/15611605/how-to-pass-a-user-defined-argument-in-scrapy-spider
def crawl_by_process(self):
crawlSettings = {};
subprocess.check_output(['scrapy', 'crawl', "demoSpider", '-a', 'cs='+json.dumps(crawlSettings)])
或者如果您想使用CrawlerProcess
# async, will return immediately and won't wait crawl finished
def crawl(self):
crawlSettings = {}
configure_logging()
s = get_project_settings()
for a in inspect.getmembers(settings):
if not a[0].startswith('_'):
# Ignores methods
if not inspect.ismethod(a[1]):
s.update({a[0]:a[1]})
# if you want to use CrawlerRunner, when you want to integrate Scrapy to existing Twisted Application
runner = CrawlerRunner(s)
d = runner.crawl(demoSpider.DemoSpider, crawlSettings)
d.addCallback(return_spider_output)
return d
def return_spider_output(output):
"""
:param output: items scraped by CrawlerRunner
:return: json with list of items
"""
# this just turns items into dictionaries
# you may want to use Scrapy JSON serializer here
return json.dumps([dict(item) for item in output])
这是我的博客文章,解释上述逻辑: https://dingyuliang.me/scrapy-how-to-build-scrapy-with-flask-rest-api-2/
答案 3 :(得分:0)
至少还有一种方法尚未使用,即使用crochet库。为了演示,我们创建了一个最小的flask应用程序,该应用程序返回JSON输出以及基本example spider的修改版本。
import crochet
crochet.setup() # initialize crochet before further imports
from flask import Flask, jsonify
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from scrapy.signalmanager import dispatcher
from myproject.spiders import example
app = Flask(__name__)
output_data = []
crawl_runner = CrawlerRunner()
# crawl_runner = CrawlerRunner(get_project_settings()) if you want to apply settings.py
@app.route("/scrape")
def scrape():
# run crawler in twisted reactor synchronously
scrape_with_crochet()
return jsonify(output_data)
@crochet.wait_for(timeout=60.0)
def scrape_with_crochet():
# signal fires when single item is processed
# and calls _crawler_result to append that item
dispatcher.connect(_crawler_result, signal=signals.item_scraped)
eventual = crawl_runner.crawl(
example.ToScrapeSpiderXPath)
return eventual # returns a twisted.internet.defer.Deferred
def _crawler_result(item, response, spider):
"""
We're using dict() to decode the items.
Ideally this should be done using a proper export pipeline.
"""
output_data.append(dict(item))
if __name__=='__main__':
app.run('0.0.0.0', 8080)
import scrapy
class MyItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'toscrape-xpath'
start_urls = [
'http://quotes.toscrape.com/',
]
def parse(self, response):
for quote in response.xpath('//div[@class="quote"]'):
return MyItem(
text=quote.xpath('./span[@class="text"]/text()').extract_first(),
author=quote.xpath('.//small[@class="author"]/text()').extract_first())
next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
if next_page_url is not None:
return scrapy.Request(response.urljoin(next_page_url))
整个设置都是以同步方式完成的,这意味着/scrape
在完成抓取过程之前不会返回任何内容。这是钩针编织文档中的一些其他信息:
设置:在设置过程中,Crochet会为您做很多事情。最重要的是,它在它管理的线程中运行Twisted的反应堆。
@wait_for:阻止对Twisted(...)的调用调用装饰函数时,代码不会在调用线程中运行,而是在反应堆线程中运行。
该功能将阻塞,直到从Twisted线程中运行的代码获得结果为止。
此解决方案的灵感来自以下2篇文章:
Execute Scrapy spiders in a Flask web application
Get Scrapy crawler output/results in script file function
请注意,这是一种非常类似于原型的方法,例如output_data
将在请求后保持其状态。如果您只是在寻找一种开始的方式,那就可以了。