我有一组包含(idx, time, category, weight, distance)
列的查询记录:
idx
是描述某种关系的INTEGER
值time
TIMESTAMP WITHOUT TIMEZONE
可以采用(几乎)任意值,但每个值都会多次出现(对于每个idx
和category
)category
是VARCHAR
和分类变量;它的价值是有限的,并且会经常发生weight
是DOUBLE PRECISION
distance
是一些预先计算的值行可能如下所示:
(1, '2017-01-01 00:00', 'class_a', 1, 234.5)
(1, '2017-01-01 00:00', 'class_a', 1, 987.1)
(1, '2017-01-01 00:00', 'class_a', 1, 1.23)
(1, '2017-01-01 00:00', 'class_b', 1, 48.5)
(2, '2017-01-01 00:00', 'class_a', 1, 8763.5)
(1, '2017-01-01 00:13', 'class_a', 1, 598.02)
(1, '2017-01-01 00:13', 'class_b', 1, 76.9)
...
(2, '2017-01-27 21:07', 'class_b', 1, 184.0)
我正在寻找一个计算这些数据的自定义聚合的解决方案,但我几乎找不到任何有关实际完成的指示或示例(希望无需编写C扩展即可) postgres的)。
SELECT
idx, time, category,
weighted_density(
value, distance, 10000.0 -- arbitrary 10k is explained below
) AS wd
FROM (my rows as shown above)
GROUP BY
idx, time, category
我觉得设置一个自定义聚合(这里名为WEIGHTED_DENSITY
)应该是实现概述查询的正确方法。我的目标是最终得到一个结果集,其中化合物(idx, time, category)
是唯一的,其wd
是使用相关行中的所有weight
和distance
值计算的。< / p>
首先,我从数据库获取整个行,并使用其他程序和语言(python)计算聚合离线。但这非常耗费资源,应该在数据库服务器而不是本地机器上运行(也是为了确保完整性)。
然后,我尝试设置一个postgres函数来使用单行计算结果值:
CREATE OR REPLACE FUNCTION _gaussian_density(
IN DOUBLE PRECISION, -- the weight
IN DOUBLE PRECISION, -- the distance
IN DOUBLE PRECISION -- the maximum distance
) RETURNS DOUBLE PRECISION AS
$BODY$
BEGIN
-- calculate weighted density, using max distance;
-- this calculation itself doesn't really matter; it's some sort
-- of density using a cropped gaussian kernel, for those who ask.
RETURN
CASE
WHEN ABS($2) > ABS($3) THEN 0.0
WHEN ABS($2) <= 0.0 THEN 1.0
ELSE
$1 * (
1.0 / |/ (2.0 * PI())
) * POWER(EXP(-1 * (3.0 * ABS($2) / ABS($3))), 2)
/ 0.4
END;
END
$BODY$
LANGUAGE plpgsql VOLATILE
COST 10;
另外,为了使该功能可用作聚合,我尝试了
CREATE AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION)
(
sfunc = _gaussian_density,
stype = DOUBLE PRECISION,
initcond = 0.0
);
但那就是我被困住的地方,我只是无法做到正确而且似乎我需要一个示例或一点点提示,将我推向正确的方向,如何正确创建和使用自定义聚合的。
为你们欢呼,并提前感谢!
感谢@klin指出我错过了携带聚合状态。现在,这终于奏效了:
CREATE FUNCTION _gaussian_density(
weight FLOAT8,
distance FLOAT8,
maxdist FLOAT8
)
RETURNS FLOAT8
IMMUTABLE
CALLED ON NULL INPUT
LANGUAGE plpgsql
AS $$
DECLARE
abs_weight FLOAT8;
abs_distance FLOAT8;
abs_maxdist FLOAT8;
dist_weight FLOAT8;
BEGIN
-- calculate weighted density, using max distance;
-- this calculation itself doesn't really matter; it's some sort
-- of density using a cropped gaussian kernel, for the curious
abs_weight := ABS(COALESCE(weight, 1.0));
abs_distance := ABS(COALESCE(distance, 0.0));
abs_maxdist := ABS(COALESCE(maxdist, 0.0));
IF abs_distance > abs_maxdist THEN RETURN 0.0; END IF;
IF abs_distance <= 0.0 THEN RETURN 1.0 * abs_weight; END IF;
RETURN abs_weight * (
1.0 / |/ (2.0 * PI())
) * POWER(EXP(-1 * (3.0 * abs_distance / abs_maxdist)), 2)
/ 0.4;
END;
$$;
CREATE FUNCTION _gaussian_statetransition(
agg_state FLOAT8, -- carry the state!
weight FLOAT8,
distance FLOAT8,
maxdist FLOAT8)
RETURNS FLOAT8
IMMUTABLE
LANGUAGE plpgsql
AS $$
BEGIN
RETURN
agg_state + _gaussian_density(weight, distance, maxdist);
END;
$$;
CREATE AGGREGATE weighted_density(FLOAT8, FLOAT8, FLOAT8)
(
sfunc = _gaussian_statetransition,
stype = FLOAT8,
initcond = 0
);
我希望仍然能够在聚合之外使用密度计算函数,因此我决定为状态转换添加另一个函数,该函数又使用函数_gaussian_density
。
聚合然后定义状态类型及其初始状态,我们很高兴。为了正确处理一些边缘情况,我稍微调整了_gaussian_density
(也用于处理NULL
值)..
非常感谢!
答案 0 :(得分:1)
函数_gaussian_density()
应取决于上一步中计算的值。如果在您的情况下这是第一个参数weight
,那么初始条件不应该是0,因为所有下一个计算将得到零作为结果。我假设weight
的初始值是1.0:
DROP AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION);
CREATE AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION)
(
sfunc = _gaussian_density,
stype = DOUBLE PRECISION,
initcond = 1.0 -- !!
);
请注意,聚合不使用表的列weight
,因为它是内部状态值,只应声明初始条件并将其作为最终结果返回。
SELECT
idx, time, category,
weighted_density(distance, 10000) AS wd -- !!
FROM my_table
GROUP BY idx, time, category
ORDER BY idx, time, category;
idx | time | category | wd
-----+---------------------+----------+---------------------
1 | 2017-01-01 00:00:00 | class_a | 0.476331421206002
1 | 2017-01-01 00:00:00 | class_b | 0.968750868953701
1 | 2017-01-01 00:13:00 | class_a | 0.69665860026144
1 | 2017-01-01 00:13:00 | class_b | 0.952383202706387
2 | 2017-01-01 00:00:00 | class_a | 0.00519142111518706
2 | 2017-01-27 21:07:00 | class_b | 0.893107967346503
(6 rows)
我不确定我是否已正确阅读您的意图,但我的言论应该让您走上正确的道路。