我想接受Flask API的多个并发请求。 API目前通过POST
方法获取“公司名称”并调用爬虫引擎,每个爬网过程需要5-10分钟才能完成。我想并行运行许多爬虫引擎,以满足不同的要求。我跟着Go Playground,但无法正常工作。目前,第二个请求正在取消第一个请求。我怎样才能实现这种并行性?
当前的API实施:
app.py
app = Flask(__name__)
app.debug = True
@app.route("/api/v1/crawl", methods=['POST'])
def crawl_end_point():
if not request.is_json:
abort(415)
inputs = CompanyNameSchema(request)
if not inputs.validate():
return jsonify(success=False, errros=inputs.errors)
data = request.get_json()
company_name = data.get("company_name")
print(company_name)
if company_name is not None:
search = SeedListGenerator(company_name)
search.start_crawler()
scrap = RunAllScrapper(company_name)
scrap.start_all()
subprocess.call(['/bin/bash', '-i', '-c', 'myconda;scrapy crawl company_profiler;'])
return 'Data Pushed successfully to Solr Index!', 201
if __name__ == "__main__":
app.run(host="10.250.36.52", use_reloader=True, threaded=True)
gunicorn.sh
#!/bin/bash
NAME="Crawler-API"
FLASKDIR=/root/Public/company_profiler
SOCKFILE=/root/Public/company_profiler/sock
LOG=./logs/gunicorn/gunicorn.log
PID=./guincorn.pid
user=root
GROUP=root
NUM_WORKERS=10 # generally in the 2-4 x $(NUM_CORES)+1 range
TIMEOUT=1200
#preload_apps = False
# The maximum number of requests a worker will process before restarting.
MAX_REQUESTS=0
echo "Starting $NAME"
# Create the run directory if it doesn't exist
RUNDIR=$(dirname $SOCKFILE)
test -d $RUNDIR || mkdir -p $RUNDIR
# Start your gunicorn
exec gunicorn app:app -b 0.0.0.0:5000 \
--name $NAME \
--worker-class gevent \
--workers 5 \
--keep-alive 900 \
--graceful-timeout 1200 \
--worker-connections 5 \
--user=$USER --group=$GROUP \
--bind=unix:$SOCKFILE \
--log-level info \
--backlog 0 \
--pid=$PID \
--access-logformat='%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s' \
--error-logfile $LOG \
--log-file=-
提前致谢!
答案 0 :(得分:2)
更好的方法 - 使用Redis的Job Queue或类似的东西。 您可以通过API请求为作业创建队列,获取结果并组织与前端的交换。每个工作都将在单独的过程中工作而不会妨碍主要应用。在其他情况下,您需要解决每一步的瓶颈问题。
良好的实施 - Redis的RQ lib或flask-rq。
import redis
from rq import Worker, Queue, Connection
listen = ['high', 'default', 'low']
redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379')
conn = redis.from_url(redis_url)
if __name__ == '__main__':
with Connection(conn):
worker = Worker(map(Queue, listen))
worker.work()
from redis import Redis
from rq import Queue
q = Queue(connection=Redis())
def crawl_end_point():
...
#adding task to queue
result = q.enqueue(crawl_end_point, timeout=3600)
#simplest way save id of job
session['j_id'] = result.get_id()
#get job status
job = Job.fetch(session['j_id'], connection=conn)
job.get_status()
#get job results
job.result
您也可以为此目的检查芹菜: https://stackshare.io/stackups/celery-vs-redis