Question

我在亚马逊Redshift上有一个约有6亿行的表。我有一个python进程，通过sqlalchemy_redshift连接发出以下查询：

begin;
UPDATE dogs
SET computed_dog_type = f_get_dog_type(name, breed, age, color)
WHERE week = :week;
commit;

该查询工作正常;但是，一次运行超过6亿行太慢了。 where子句有效地限制了一周的行数，其中一周的行数范围从2K到2-3百万。

我将查询代码包装在python线程中，执行如下所示：

16:38 $ python dog_classifier.py update_range 2009-10-05 2009-10-26
11-02 16:39 PTC          INFO     DOG CLASSIFIER STARTED
11-02 16:39 PTC          INFO     START update of dogs.computed_dog_type for week: 2009-10-05
11-02 16:39 PTC          INFO     START update of dogs.computed_dog_type for week: 2009-10-12
11-02 16:39 PTC          INFO     START update of dogs.computed_dog_type for week: 2009-10-19
11-02 16:39 PTC          INFO     START update of dogs.computed_dog_type for week: 2009-10-26
11-02 16:45 PTC          INFO     END update of 338378 records in dogs.computed_dog_type for week: 2009-10-12 in 6 minutes
11-02 16:52 PTC          INFO     END update of 355796 records in dogs.computed_dog_type for week: 2009-10-05 in 13 minutes
11-02 16:59 PTC          INFO     END update of 337909 records in dogs.computed_dog_type for week: 2009-10-19 in 20 minutes
11-02 17:07 PTC          INFO     END update of 281617 records in dogs.computed_dog_type for week: 2009-10-26 in 28 minutes
11-02 17:07 PTC          INFO     DOG CLASSIFIER STOPPED AFTER UPDATING 1313700 RECORDS

我一次运行一个月 - 通常，4-5周的数据大约有一百万行左右。

好像查询是在redshift上序列化的。如果在这些查询运行时检查云监视仪表板的输出，则峰值和谷值与我的更新查询非常明显地相关，其中每次运行每个查询基本上有一周峰值。

我认为默认查询队列可能是罪魁祸首，但在运行时检查其行为似乎具有挑战性。

我该如何调试？什么会导致查询像这样序列化？

Answer 1

您的用户定义函数（UDF）可能正在减慢处理速度。 UDF Constraint documentation说：

每个群集可以并发运行的UDF数量仅限于群集总并发级别的四分之一。例如，如果群集配置为并发为15，则最多可以同时运行三个UDF。达到限制后，UDF在工作负载管理队列中排队等待执行。

此外，如果您的UDF使用IMMUTABLE return type，则Redshift可以缓存UDF返回值，这有助于加快操作速度。

您可以通过创建包含所有值的查找表并加入该表来避免运行UDF，从而允许Redshift优化查询。这样的表应设置为DISTKEY ALL以在所有节点上分发表。

为什么Amazon Redshift序列化我的并发更新？

1 个答案: