我正在使用Flask,Celery和PhantomJS(Selenium)在Heroku上设置一个基于网络的刮板。刮板在我的本地计算机上运行(警告PhantomJS已弃用),但是当我开始刮板时,Celery在Heroku上冻结。
我通过buildpack https://github.com/stomita/heroku-buildpack-phantomjs安装了Web驱动程序。我还尝试过将Chrome驱动程序用于Selenium,但同样的事情-Celery冻结(通过日志查看,该过程没有继续进行)。
我的服务器上的某些代码。py:
app = Flask(__name__)
app.config['CELERY_BROKER_URL'] = os.environ['REDIS_URL']
app.config['CELERY_RESULT_BACKEND'] = os.environ['REDIS_URL']
celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL'])
celery.conf.update(app.config)
@celery.task
def task_scrape(url):
return do_scrape(url, standalone=False)
@app.route('/')
def index():
return render_template('index.html')
@app.route('/scrape', methods=['POST'])
def scrape():
url = request.get_json()['url']
task = task_scrape.delay(url)
return jsonify({ 'taskid' : task.id })
我的app.py(执行实际抓取)上的一些代码:
def render_type2_page(url):
driver = webdriver.PhantomJS()
driver.get(url)
time.sleep(3)
r = driver.page_source
driver.quit()
return r
预期结果是日志显示celery worker正在将抓取的数据写入内存,这些数据很快将以csv的形式下载。像这样:
[2019-07-03 01:13:36,036: WARNING/ForkPoolWorker-4] [*] Writing item # 2354187
[2019-07-03 01:13:37,269: WARNING/ForkPoolWorker-4] [*] Writing item # 2410452
[2019-07-03 01:13:38,505: WARNING/ForkPoolWorker-4] [*] Writing item # 2307212
[2019-07-03 01:13:39,844: WARNING/ForkPoolWorker-4] [*] Writing item # 2307709
[2019-07-03 01:13:41,055: WARNING/ForkPoolWorker-4] [*] Writing item # 2330733
[2019-07-03 01:13:42,283: WARNING/ForkPoolWorker-4] [*] Writing item # 2400294
[2019-07-03 01:13:43,501: WARNING/ForkPoolWorker-4] [*] Writing item # 2277081
[2019-07-03 01:13:44,729: WARNING/ForkPoolWorker-4] [*] Writing item # 2306055
[2019-07-03 01:13:45,991: WARNING/ForkPoolWorker-4] [*] Writing item # 2329127
[2019-07-03 01:13:47,312: WARNING/ForkPoolWorker-4] [*] Writing item # 2390199
[2019-07-03 01:13:48,545: WARNING/ForkPoolWorker-4] [*] Writing item # 2400295
[2019-07-03 01:13:49,797: WARNING/ForkPoolWorker-4] [*] Writing item # 2328693
但是实际结果是硒超时,芹菜工人根本不做任何工作。