我在Heroku上部署了一个Django应用程序,目的是允许可信,已知的内部用户上传CSV文件,点击“运行”,然后在幕后使用Django应用程序:
.pkl
模型(120 MB大小,让我们说)predict
这适用于小型CSV文件,但如果用户上传大型CSV文件,则会导致Memory quota vastly exceeded
...而且较大的CSV文件会增加内存消耗。
我不确定在哪里调整。 我想知道在部署sklearn模型时他们是否经历过类似的情况以及他们如何“解决”它?
我的想法是:
DEBUG
设置为False
。我的django models.py看起来像这样:
from django.db import models
from django.urls import reverse
class MLModel(models.Model):
name = models.CharField(max_length=80)
file = models.FileField(upload_to="models/")
created = models.DateTimeField(auto_now_add=True)
updated = models.DateTimeField(auto_now=True)
def __str__(self):
return self.name
class Upload(models.Model):
name = models.CharField(max_length=100)
mlmodel = models.ForeignKey(MLModel, on_delete=models.CASCADE)
file = models.FileField(upload_to='data/')
def __str__(self):
return self.name
def get_absolute_url(self):
return reverse('edit', kwargs={'pk': self.pk})
我的芹菜任务看起来像这样:
@shared_task
def piparoo(id):
instance = Upload.objects.get(id=id)
model = joblib.load(instance.mlmodel.file.storage.open(instance.mlmodel.file.name))
data = pd.read_csv(instance.file)
data['Predicted'] = model.predict(data)
buffer = StringIO()
data.to_csv(buffer, index=False)
content = buffer.getvalue().encode('utf-8')
default_storage.save('output/results_{}.csv'.format(id), ContentFile(content))
Heroku日志:
2018-04-12T06:12:53.592922+00:00 app[worker.1]: [2018-04-12 06:12:53,592: INFO/MainProcess] Received task: predictions.tasks.piparoo[f1ca09e1-6bba-4115-8989-04bb32d4f08e]
2018-04-12T06:12:53.737378+00:00 heroku[router]: at=info method=GET path="/predict/" host=tdmpredict.herokuapp.com request_id=ffad9785-5cb6-4e3c-a87c-94cbca47d109 fwd="24.16.35.31" dyno=web.1 connect=0
ms service=33ms status=200 bytes=6347 protocol=https
2018-04-12T06:13:08.054486+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2018-04-12T06:13:08.054399+00:00 heroku[worker.1]: Process running mem=572M(111.9%)
2018-04-12T06:13:28.026973+00:00 heroku[worker.1]: Error R15 (Memory quota vastly exceeded)
2018-04-12T06:13:28.026765+00:00 heroku[worker.1]: Process running mem=1075M(210.1%)
2018-04-12T06:13:28.026973+00:00 heroku[worker.1]: Stopping process with SIGKILL
2018-04-12T06:13:28.187650+00:00 heroku[worker.1]: Process exited with status 137
2018-04-12T06:13:28.306221+00:00 heroku[worker.1]: State changed from up to crashed
答案 0 :(得分:0)
解决我的问题的解决方案(以常识的方式)。
不是一次将用户的CSV文件读入内存,而是使用Pandas chunksize
参数以块的形式处理它,然后在最后将数据帧列表连接成一个。我还删除了模型(120 MB),试图为将来的进程释放内存。
我的芹菜任务现在看起来像这样:
@shared_task
def piparoo(id):
instance = Upload.objects.get(id=id)
model = joblib.load(instance.mlmodel.file.storage.open(instance.mlmodel.file.name))
final = []
for chunk in pd.read_csv(instance.file, chunksize=5000):
chunk['Predicted'] = model.predict(chunk)
final.append(chunk)
del model
final = pd.concat(final)
buffer = StringIO()
final.to_csv(buffer, index=False)
content = buffer.getvalue().encode('utf-8')
default_storage.save('output/results_{}.csv'.format(id), ContentFile(content))