雪花 ST_POLYGON(TO_GEOGRAPHY(...)) 效率低下

时间:2021-03-03 22:36:18

标签: gis snowflake-cloud-data-platform

我有几个使用地理空间条件的查询。这些查询的运行速度出奇地慢。最初我以为是地理空间计算本身,但是将所有内容剥离到仅 ST_POLYGON(TO_GEOGRAPHY(...)),它仍然很慢。如果每一行都有自己的多边形,这将是有意义的,但条件在查询中使用静态多边形:

SELECT 
    ST_POLYGON(TO_GEOGRAPHY('LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'))
FROM TABLE(GENERATOR(ROWCOUNT=>1000000))

Snowflake 应该能够弄清楚它只需要为整个查询计算一次这个多边形。然而,添加的行越多,它变得越慢。在 x-small 上,此查询需要一分钟多的时间。此查询的位置:

SELECT 
    'LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'
FROM TABLE(GENERATOR(ROWCOUNT=>3000000))

(添加了 2 毫米以上的行以匹配字节数)

可以在 2 秒内完成。

我尝试使用 WITH 语句自己“预先计算”多边形,但 SF 发现 WITH 是多余的并删除了它。我也试过设置会话变量,但是你不能设置像这样的复杂值作为变量。

我认为这是一个错误。

1 个答案:

答案 0 :(得分:3)

地理空间功能目前处于预览阶段,团队正在努力进行各种优化。

对于这种情况,我想指出,将多边形设为单行表会有所帮助,但随着团队将此功能从测试版中推出,我仍然希望获得更好的性能。

让我创建一个只有一行的表格,多边形:

create or replace temp table poly1
as
select ST_POLYGON(TO_GEOGRAPHY('LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'
       )) polygon
;

为了看看这是否有帮助,我尝试了 100 万行交叉连接:

select *
from poly1, TABLE(GENERATOR(ROWCOUNT=>1000000));

这需要 14 秒,在查询分析器中,您可以看到大部分时间都花在了内部 TO_OBJECT​(​GET_PATH​(​POLY1​.​POLYGON, '_shape'​)​​ 上。

enter image description here

需要注意的是,前面的操作主要关注多边形的ascii表示。在这个多边形上运行操作要快得多:

select st_area(polygon)
from poly1, TABLE(GENERATOR(ROWCOUNT=>1000000));

这个查询应该花费更长的时间(找到一个多边形的区域听起来比仅仅选择它更复杂),但结果它只花了 7 秒(~一半)。

enter image description here

感谢您的报告,团队将继续优化此类案例。


对于任何对问题中的特定多边形感到好奇的人 - 这是一颗善良的心:

enter image description here