I'm building a site that relies on the output of a machine learning algorithm. All that is needed for the user-facing part of the site is the output of the algorithm (class labels for a set of items), which can be easily stored and retrieved from the django
models. The algorithm could be run once a day, and does not rely on user input.
So this part of the site only depends on django
and related packages.
But developing, tuning, and evaluating the algorithm uses many other python packages such as scikit-learn
, pandas
, numpy
, matplotlib
, etc. It also requires saving many different sets of class labels.
These dependencies cause some issues when deploying to heroku
, because numpy
requires LAPACK/BLAS.
It also seems like it would be good practice to have as few dependencies as possible in the deployed app.
How can I separate the machine-learning part from the user-facing part, but, still have them integrated enough that the results of the algorithm are easily used?
I thought of creating two separate projects, and then writing to the user-facing database in some way, but that seems like it would lead to maintance problems (managing the dependencies, changes in database schemas etc).
As far as I understand, this problem is a little bit different than using different settings or databases for production and development, because it is more about managing different sets of dependencies.
答案 0 :(得分:3)
如果人们有同样的问题,请将我们讨论过的内容移到答案中,我的建议是:
花一些时间来定义您的网站和算法代码的依赖关系。
将依赖关系列表转储到每个项目的requirements.txt
。
将它们部署在不同的环境中,以免发生冲突。
使用Django Rest Framework或Tastypie在您的网站端开发一些API端点,并让您的算法代码使用API更新您的模型。使用cron
定期运行算法代码并推送数据。
答案 1 :(得分:1)
为每个环境创建一个需求文件,并为所有环境共享的那些包创建基本需求文件。
$ mkdir requirements
$ pip freeze > requirements/base.txt
$ echo "-r base.txt" > requirements/development.txt
$ echo "-r base.txt" > requirements/production.txt
然后调整开发和生产依赖关系并在适当的环境中安装每个
#change to your development virtualenv
#$source .virtualenvs/development/bin/activate
$ pip install -r requirements/development.txt
#change to your production virtualenv
#$source .virtualenvs/production/bin/activate
$ pip install -r requirements/production.txt
答案 2 :(得分:1)
pip-tools 是正确的工具。我遇到了同样的问题,所以我以这种简单的方式解决了它。 在下面引用的 dev-requirements.in 文件中,您必须放置机器学习库。
来自site:
如果您有不同的环境需要为其安装不同但兼容的包,那么您可以创建分层的需求文件并使用一层来约束另一层。
例如,如果您有一个 Django 项目,您希望在其中发布最新的 2.1 版本,并且在开发时希望使用 Django 调试工具栏,那么您可以创建两个 *.in 文件,每个层一个:>
# requirements.in
django<2.2
在开发需求 dev-requirements.in 的顶部,您使用 -c requirements.txt 将开发需求限制为已在 requirements.txt 中选择用于生产的包。
# dev-requirements.in
-c requirements.txt
django-debug-toolbar
首先,像往常一样编译requirements.txt:
$ pip-compile
#
# This file is autogenerated by pip-compile
# To update, run:
#
# pip-compile
#
django==2.1.15
# via -r requirements.in
pytz==2019.3
# via django
现在编译开发需求,并将requirements.txt文件用作约束:
$ pip-compile dev-requirements.in
#
# This file is autogenerated by pip-compile
# To update, run:
#
# pip-compile dev-requirements.in
#
django-debug-toolbar==2.2
# via -r dev-requirements.in
django==2.1.15
# via
# -c requirements.txt
# django-debug-toolbar
pytz==2019.3
# via
# -c requirements.txt
# django
sqlparse==0.3.0
# via django-debug-toolbar
正如您在上面看到的,即使 Django 2.2 版本可用,但开发要求仅包括 Django 的 2.1 版本,因为它们受到限制。现在两个编译的需求文件都可以安全地安装在开发环境中了。
在生产阶段安装需求使用:
$ pip-sync
您可以通过以下方式在开发阶段安装需求:
$ pip-sync requirements.txt dev-requirements.txt