Question

我有一组代表有向图的Django ORM模型，我试图将所有相邻顶点检索到给定顶点而忽略边缘方向：

class Vertex(models.Model):
    pass

class Edge(models.Model):
    orig = models.ForeignKey(Vertex, related_name='%(class)s_orig', null=True, blank=True)
    dest = models.ForeignKey(Vertex, related_name='%(class)s_dest', null=True, blank=True)
    # ... other data about this edge ...

查询Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct()会返回正确的结果，但就我而言，执行时间太长。

通常对于我的应用程序，在任何给定时间将有大约50-100个顶点，并且大约有一百万个边缘。即使将其减少到仅20个顶点和100000个边缘，该查询也需要大约一分半的时间来执行：

for i in range(20):
    Vertex().save()

vxs = list(Vertex.objects.all())
for i in tqdm.tqdm(range(100000)):
    Edge(orig = random.sample(vxs,1)[0], dest = random.sample(vxs,1)[0]).save()

v = vxs[0]
def f1():
    return list( Vertex.objects.filter(
        Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct() )

t1 = timeit.Timer(f1)

print( t1.timeit(number=1) ) # 84.21138522100227

另一方面，如果我将查询分成两部分，我只能在几毫秒内得到完全相同的结果：

def f2():
    q1 = Vertex.objects.filter(edge_orig__dest=v).distinct()
    q2 = Vertex.objects.filter(edge_dest__orig=v).distinct()
    return list( {x for x in itertools.chain(q1, q2)} )

t2 = timeit.Timer(f2)
print( t2.timeit(number=100)/100 ) # 0.0109818680600074

第二个版本存在问题：

它不是原子的。边缘列表几乎可以保证在我的应用程序中的两个查询之间发生变化，这意味着结果不会代表单个时间点。
我无法对结果执行其他处理和聚合，而无需手动循环。（例如，如果我想要Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct().aggregate(avg=Avg('some_field'))）

为什么第二个版本的运行速度比第一个版本快得多？我怎么能这样做，有没有办法让第一个运行得足够快以便实际使用？

我目前正在使用Python 3.5.2，PostgreSQL 9.5.6和Django 1.11进行测试，但如果这是其中一个问题，我就不会遇到问题。

这是每个查询生成的SQL，以及Postgres的解释：

第一个：

Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v))

SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
LEFT OUTER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
LEFT OUTER JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061 
       OR T4."orig_id" = 1061)

HashAggregate  (cost=8275151.47..8275151.67 rows=20 width=4)
  Group Key: app_vertex.id
  ->  Hash Left Join  (cost=3183.45..8154147.45 rows=48401608 width=4)
        Hash Cond: (app_vertex.id = app_edge.orig_id)
        Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
        ->  Hash Right Join  (cost=1.45..2917.45 rows=100000 width=8)
              Hash Cond: (t4.dest_id = app_vertex.id)
              ->  Seq Scan on app_edge t4  (cost=0.00..1541.00 rows=100000 width=8)
              ->  Hash  (cost=1.20..1.20 rows=20 width=4)
                    ->  Seq Scan on app_vertex  (cost=0.00..1.20 rows=20 width=4)
        ->  Hash  (cost=1541.00..1541.00 rows=100000 width=8)
              ->  Seq Scan on app_edge  (cost=0.00..1541.00 rows=100000 width=8)

第二个：

Vertex.objects.filter(edge_orig__dest=v).distinct()

SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
INNER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
WHERE "app_edge"."dest_id" = 1061

HashAggregate  (cost=1531.42..1531.62 rows=20 width=4)
  Group Key: app_vertex.id
  ->  Hash Join  (cost=848.11..1519.04 rows=4950 width=4)
        Hash Cond: (app_edge.orig_id = app_vertex.id)
        ->  Bitmap Heap Scan on app_edge  (cost=846.65..1449.53 rows=4950 width=4)
              Recheck Cond: (dest_id = 1061)
              ->  Bitmap Index Scan on app_edge_dest_id_4254b90f  (cost=0.00..845.42 rows=4950 width=0)
                    Index Cond: (dest_id = 1061)
        ->  Hash  (cost=1.20..1.20 rows=20 width=4)
              ->  Seq Scan on app_vertex  (cost=0.00..1.20 rows=20 width=4)

@ khampson的版本也需要一分半钟的时间才能运行，所以它也是不行的。

Vertex.objects.raw( ... )

SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
       OR T4."orig_id" = 1061);

HashAggregate  (cost=8275347.47..8275347.67 rows=20 width=4)
  Group Key: app_vertex.id
  ->  Hash Join  (cost=3183.45..8154343.45 rows=48401608 width=4)
        Hash Cond: (app_vertex.id = app_edge.orig_id)
        Join Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
        ->  Hash Join  (cost=1.45..2917.45 rows=100000 width=12)
              Hash Cond: (t4.dest_id = app_vertex.id)
              ->  Seq Scan on app_edge t4  (cost=0.00..1541.00 rows=100000 width=8)
              ->  Hash  (cost=1.20..1.20 rows=20 width=4)
                    ->  Seq Scan on app_vertex  (cost=0.00..1.20 rows=20 width=4)
        ->  Hash  (cost=1541.00..1541.00 rows=100000 width=8)
              ->  Seq Scan on app_edge  (cost=0.00..1541.00 rows=100000 width=8)

Answer 1

这两个查询的查询计划完全不同。第一个（较慢的）没有访问任何索引，并且正在执行两个left join，这两种方式都会导致处理和返回更多行。根据我对 Django ORM语法的意图的解释，它听起来并不像你真的想在这里做left join。

我建议在这种情况下从 Django ORM中考虑下降到原始的 SQL ，并将两者混合。例如如果你拿第一个，并将它转换成这样的东西：

SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
       OR T4."orig_id" = 1061);

有两个问题：该版本的表现如何，是否能为您提供所需的结果？

有关原始查询的更多信息，请查看 Django 文档的this section。

对OP评论的回应：

我建议的查询的查询计划也显示它没有命中任何索引。

对于所涉及的列，您是否在两个表上都有索引？我怀疑没有，特别是因为对于这个特定的查询，我们正在寻找单个值，这意味着如果有索引，如果查询规划器确定顺序扫描是更好的选择，我会非常惊讶（OTOH，如果你是如果查找表格中超过10％的行，查询计划程序可能会正确地做出这样的决定。

Answer 2

我建议另一个查询可能是：

# Get edges which contain Vertex v, "only" optimizes fields returned
edges = Edge.objects.filter(Q(orig=v) | Q(dest=v)).only('orig_id', 'dest_id')
# Get set of vertex id's to discard duplicates
vertex_ids = {*edges.values_list('orig_id', flat=True), *edges_values_list('dest_id', flat=True)}
# Get list of vertices, excluding the original vertex
vertices = Vertex.objects.filter(pk__in=vertex_ids).exclude(pk=v.pk)

这不应该要求任何连接，也不应该受到你提到的竞争条件的影响。

检索邻居的查询太慢了

2 个答案: