我有一段django代码,用于迭代模型的查询集并删除任何匹配的代码。查询集已经变大,并且这些操作实际上设置为周期性任务,因此速度正在成为一个问题。
以下是代码,如果有人愿意尝试帮助优化它的话!
# For the below code, "articles" are just django models
all_articles = [a reallly large list of articles]
newest_articles = [some large list of new articles]
unique_articles = []
for new_article in newest_articles:
failed = False
for old_article in all_articles:
# is_similar is just a method which checks if two strings are
# identical to a certain degree
if is_similar(new_article.blurb, old_article.blurb, 0.9)
and is_similar(new_article.title, old_article.title, 0.92):
failed = True
break
if not failed:
unique_articles.append(new_article)
return unique_articles
谢谢你们!
答案 0 :(得分:1)
似乎没有任何有效的方法在SQL级别实现“模糊DISTINCT”,因此我建议采用预计算路由。
试图从一个小的代码片段猜测你的业务逻辑,所以这可能是基础的,但听起来你只需要知道每个新文章,如果它有较旧的dupes(由is_similar函数定义)。在这种情况下,可行的方法可能是在文章模型中添加is_duplicate
字段,并在保存文章时在后台作业中重新计算它。例如。 (使用Celery):
@task
def recompute_similarity(article_id):
article = Article.objects.get(id=article_id)
article.is_duplicate = False
for other in Article.objects.exclude(id=article_id):
if is_similar(article.title, other.title) or is_similar(article.blurb, other.blurb):
article.is_duplicate = True
break
article.save()
def on_article_save(sender, instance, created, raw, **kwargs):
if not raw:
recompute_similarity.delay(instance.id)
signals.post_save.connect(on_article_save, sender=Article)
然后你的原始例程将简化为
Article.objects.filter(is_duplicate=False, ...recency condition)
答案 1 :(得分:1)