Django Count和Sum批注相互干扰

时间:2019-06-12 17:51:30

标签: python django django-queryset

在构建带有多个注释的复合体QuerySet时,我遇到了一个问题,该问题可以通过以下简单设置重现。

以下是模型:

class Player(models.Model):
    name = models.CharField(max_length=200)

class Unit(models.Model):
    player = models.ForeignKey(Player, on_delete=models.CASCADE,
                               related_name='unit_set')
    rarity = models.IntegerField()

class Weapon(models.Model):
    unit = models.ForeignKey(Unit, on_delete=models.CASCADE,
                             related_name='weapon_set')

通过测试数据库,我得到以下(正确)结果:

Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))

[{'id': 1, 'name': 'James', 'weapon_count': 23},
 {'id': 2, 'name': 'Max', 'weapon_count': 41},
 {'id': 3, 'name': 'Bob', 'weapon_count': 26}]


Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))

[{'id': 1, 'name': 'James', 'rarity_sum': 42},
 {'id': 2, 'name': 'Max', 'rarity_sum': 89},
 {'id': 3, 'name': 'Bob', 'rarity_sum': 67}]

如果现在将两个注释合并在同一个QuerySet中,则会得到不同(不准确)的结果:

Player.objects.annotate(
    weapon_count=Count('unit_set__weapon_set', distinct=True),
    rarity_sum=Sum('unit_set__rarity'))

[{'id': 1, 'name': 'James', 'weapon_count': 23, 'rarity_sum': 99},
 {'id': 2, 'name': 'Max', 'weapon_count': 41, 'rarity_sum': 183},
 {'id': 3, 'name': 'Bob', 'weapon_count': 26, 'rarity_sum': 113}]

请注意,rarity_sum现在具有与以前不同的值。删除distinct=True不会影响结果。我还尝试使用this answer中的DistinctSum函数,在这种情况下,所有rarity_sum都设置为18(同样不准确)。

这是为什么?如何将两个注释合并在同一QuerySet中?

编辑:这是合并的QuerySet生成的sqlite查询:

SELECT "sandbox_player"."id",
       "sandbox_player"."name",
       COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
       SUM("sandbox_unit"."rarity")          AS "rarity_sum"
FROM "sandbox_player"
         LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
         LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

用于以上结果的数据为available here

4 个答案:

答案 0 :(得分:4)

这不是Django ORM的问题,这只是关系数据库的工作方式。当您构建简单的查询集(如

)时
Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))

Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))

ORM完全按照您的期望去做-将PlayerWeapon一起加入

SELECT "sandbox_player"."id", "sandbox_player"."name", COUNT("sandbox_weapon"."id") AS "weapon_count"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" 
    ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon" 
    ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

PlayerUnit

SELECT "sandbox_player"."id", "sandbox_player"."name", SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

并对它们执行COUNTSUM聚合。

请注意,尽管第一个查询在三个表之间具有两个联接,但是中间表Unit既不在SELECT引用的列中,也不在GROUP BY子句中。 Unit在这里扮演的唯一角色是将PlayerWeapon一起加入。

现在,如果您查看第三个查询集,情况将变得更加复杂。再次,与第一个查询一样,联接位于三个表之间,但是现在Unit中引用了SELECT,因为SUMUnit.rarity聚合:

SELECT "sandbox_player"."id",
       "sandbox_player"."name",
       COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
       SUM("sandbox_unit"."rarity")          AS "rarity_sum"
FROM "sandbox_player"
         LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
         LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

这是第二和第三查询之间的关键区别。在第二个查询中,您要将PlayerUnit连接起来,因此将为它引用的每个玩家列出一个Unit

但是在第三个查询中,您将Player联接到Unit,然后将Unit联接到Weapon,所以不仅会列出单个Unit对于它引用的每个玩家,一次,,但是对于引用Unit 的每种武器,一次。

让我们看一个简单的例子:

insert into sandbox_player values (1, "player_1");

insert into sandbox_unit values(1, 10, 1);

insert into sandbox_weapon values (1, 1), (2, 1);

一个玩家,一个单位和两个引用相同单位的武器。

确认问题存在:

>>> from sandbox.models import Player
>>> from django.db.models import Count, Sum

>>> Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2}]>

>>> Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'rarity_sum': 10}]>


>>> Player.objects.annotate(
...     weapon_count=Count('unit_set__weapon_set', distinct=True),
...     rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 20}]>

从该示例可以很容易地看出问题是在组合查询中该单位将被列出两次,每种引用该单位的武器一次:

sqlite> SELECT "sandbox_player"."id",
   ...>        "sandbox_player"."name",
   ...>        "sandbox_weapon"."id",
   ...>        "sandbox_unit"."rarity"
   ...> FROM "sandbox_player"
   ...>          LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
   ...>          LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id");
id          name        id          rarity    
----------  ----------  ----------  ----------
1           player_1    1           10        
1           player_1    2           10   

你应该怎么做?

正如@ivissani所提到的,最简单的解决方案之一就是为每个聚合编写子查询:

>>> from django.db.models import Count, Sum, Subquery, IntegerField
>>> weapon_count = Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).filter(pk=OuterRef('pk'))
>>> rarity_sum = Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).filter(pk=OuterRef('pk'))
>>> qs = Player.objects.annotate(
...     weapon_count=Subquery(weapon_count.values('weapon_count'), output_field=IntegerField()),
...     rarity_sum=Subquery(rarity_sum.values('rarity_sum'), output_field=IntegerField())
... )
>>> qs.values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 10}]>

产生以下SQL

SELECT "sandbox_player"."id", "sandbox_player"."name", 
(
    SELECT COUNT(U2."id") AS "weapon_count"
    FROM "sandbox_player" U0 
    LEFT OUTER JOIN "sandbox_unit" U1
        ON (U0."id" = U1."player_id")
    LEFT OUTER JOIN "sandbox_weapon" U2 
        ON (U1."id" = U2."unit_id")
    WHERE U0."id" = ("sandbox_player"."id") 
    GROUP BY U0."id", U0."name"
) AS "weapon_count", 
(
    SELECT SUM(U1."rarity") AS "rarity_sum"
    FROM "sandbox_player" U0
    LEFT OUTER JOIN "sandbox_unit" U1
        ON (U0."id" = U1."player_id")
    WHERE U0."id" = ("sandbox_player"."id")
GROUP BY U0."id", U0."name") AS "rarity_sum"
FROM "sandbox_player"

答案 1 :(得分:3)

一些补充rktavi出色答案的笔记:

1)显然已经将这个问题视为bug了10年。 official documentation中甚至提到了它。

2)在将实际项目的QuerySets转换为子查询(根据rktavi的回答)时,我注意到将裸骨注释(对于始终有效的distinct=True计数)与Subquery( (总和)会产生非常长的处理时间(35秒vs. 100 ms),而且总和的结果不正确。在我的实际设置中,这是正确的(在各种嵌套关系上为11个过滤后的计数,在多重嵌套关系上为SQLite3过滤了1个和),但是不能用上面的简单模型重现。这个问题可能很棘手,因为代码的另一部分可能会在QuerySet中添加注释(例如Table.order_FOO()函数)。

3)在相同的设置下,我有证据表明,子查询类型的QuerySet比裸注注释QuerySet更快(当然,在只有distinct=True个计数的情况下)。我可以在本地SQLite3(83毫秒vs 260毫秒)和托管PostgreSQL(320毫秒vs 540毫秒)中观察到这一点。

由于上述原因,我将完全避免使用准子注释来支持子查询。

答案 2 :(得分:1)

基于@rktavi的出色回答,我创建了两个帮助程序类,它们简化了time / SubqueryCount / Subquery模式:

Sum

人们可以像这样使用这些助手:

class SubqueryCount(Subquery):
    template = "(SELECT count(*) FROM (%(subquery)s) _count)"
    output_field = PositiveIntegerField()


class SubquerySum(Subquery):
    template = '(SELECT sum(_sum."%(column)s") FROM (%(subquery)s) _sum)'

    def __init__(self, queryset, column, output_field=None, **extra):
        if output_field is None:
            output_field = queryset.model._meta.get_field(column)
        super().__init__(queryset, output_field, column=column, **extra)

答案 3 :(得分:1)

感谢@this 的精彩回答!!

这是我的用例:

使用 Django DRF。

我需要从注释内的不同 FK 中获取 Sum 和 Count,以便它们都成为一个查询集的一部分,以便将这些字段添加到 DRF 中的 ordering_fields。

Sum 和 Count 发生冲突并返回错误结果。 你的回答真的帮助我把这一切整合在一起。

注释偶尔会将日期返回为 strings,因此我需要将其转换为 DateTimeField。

    donation_filter =  Q(payments__status='donated') & ~Q(payments__payment_type__payment_type='coupon')
    total_donated_SQ = User.objects.annotate(total_donated=Sum('payments__sum', filter=donation_filter )).filter(pk=OuterRef('pk'))
    message_count_SQ = User.objects.annotate(message_count=Count('events__id', filter=Q(events__event_id=6))).filter(pk=OuterRef('pk'))
    queryset = User.objects.annotate(
        total_donated=Subquery(total_donated_SQ.values('total_donated'), output_field=IntegerField()),
        last_donation_date=Cast(Max('payments__updated', filter=donation_filter ), output_field=DateTimeField()),
        message_count=Subquery(message_count_SQ.values('message_count'), output_field=IntegerField()),
        last_message_date=Cast(Max('events__updated', filter=Q(events__event_id=6)), output_field=DateTimeField())
    )