使用多列的postgres中的自定义聚合

时间:2017-10-18 14:22:24

标签: sql postgresql

免责声明:以下解决方案

我有一组包含(idx, time, category, weight, distance)列的查询记录:

  • idx是描述某种关系的INTEGER
  • time TIMESTAMP WITHOUT TIMEZONE可以采用(几乎)任意值,但每个值都会多次出现(对于每个idxcategory
  • categoryVARCHAR和分类变量;它的价值是有限的,并且会经常发生
  • weightDOUBLE PRECISION
  • distance是一些预先计算的值

行可能如下所示:

(1, '2017-01-01 00:00', 'class_a', 1, 234.5)
(1, '2017-01-01 00:00', 'class_a', 1, 987.1)
(1, '2017-01-01 00:00', 'class_a', 1, 1.23)
(1, '2017-01-01 00:00', 'class_b', 1, 48.5)
(2, '2017-01-01 00:00', 'class_a', 1, 8763.5)
(1, '2017-01-01 00:13', 'class_a', 1, 598.02)
(1, '2017-01-01 00:13', 'class_b', 1, 76.9)
...
(2, '2017-01-27 21:07', 'class_b', 1, 184.0)

问题是什么?

我正在寻找一个计算这些数据的自定义聚合的解决方案,但我几乎找不到任何有关实际完成的指示或示例(希望无需编写C扩展即可) postgres的)。

SELECT
  idx, time, category,
  weighted_density(
    value, distance, 10000.0 -- arbitrary 10k is explained below
  ) AS wd
FROM (my rows as shown above)
GROUP BY
  idx, time, category

我觉得设置一个自定义聚合(这里名为WEIGHTED_DENSITY)应该是实现概述查询的正确方法。我的目标是最终得到一个结果集,其中化合物(idx, time, category)是唯一的,其wd是使用相关行中的所有weightdistance值计算的。< / p>

免责声明:以下解决方案

到目前为止我尝试了什么?

首先,我从数据库获取整个行,并使用其他程序和语言(python)计算聚合离线。但这非常耗费资源,应该在数据库服务器而不是本地机器上运行(也是为了确保完整性)。

然后,我尝试设置一个postgres函数来使用单行计算结果值:

CREATE OR REPLACE FUNCTION _gaussian_density(
    IN DOUBLE PRECISION, -- the weight
    IN DOUBLE PRECISION, -- the distance
    IN DOUBLE PRECISION  -- the maximum distance
  ) RETURNS DOUBLE PRECISION AS
$BODY$
BEGIN
  -- calculate weighted density, using max distance;
  -- this calculation itself doesn't really matter; it's some sort
  -- of density using a cropped gaussian kernel, for those who ask.
  RETURN
    CASE
      WHEN ABS($2) > ABS($3) THEN 0.0
      WHEN ABS($2) <= 0.0 THEN 1.0
      ELSE
        $1 * (
          1.0 / |/ (2.0 * PI())
        ) * POWER(EXP(-1 * (3.0 * ABS($2) / ABS($3))), 2)
        / 0.4
    END;
END
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 10;

另外,为了使该功能可用作聚合,我尝试了

CREATE AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION)
(
    sfunc = _gaussian_density,
    stype = DOUBLE PRECISION,
    initcond = 0.0
);

但那就是我被困住的地方,我只是无法做到正确而且似乎我需要一个示例或一点点提示,将我推向正确的方向,如何正确创建和使用自定义聚合的。

为你们欢呼,并提前感谢!

感谢@klin指出我错过了携带聚合状态。现在,这终于奏效了:

CREATE FUNCTION _gaussian_density(
    weight FLOAT8,
    distance FLOAT8,
    maxdist FLOAT8
  )
RETURNS FLOAT8
IMMUTABLE
CALLED ON NULL INPUT
LANGUAGE plpgsql
AS $$
  DECLARE
    abs_weight FLOAT8;
    abs_distance FLOAT8;
    abs_maxdist FLOAT8;
    dist_weight FLOAT8;
  BEGIN
    -- calculate weighted density, using max distance;
    -- this calculation itself doesn't really matter; it's some sort
    -- of density using a cropped gaussian kernel, for the curious
    abs_weight := ABS(COALESCE(weight, 1.0));
    abs_distance := ABS(COALESCE(distance, 0.0));
    abs_maxdist := ABS(COALESCE(maxdist, 0.0));
    IF abs_distance > abs_maxdist THEN RETURN 0.0; END IF;
    IF abs_distance <= 0.0 THEN RETURN 1.0 * abs_weight; END IF;
    RETURN abs_weight * (
            1.0 / |/ (2.0 * PI())
          ) * POWER(EXP(-1 * (3.0 * abs_distance / abs_maxdist)), 2)
          / 0.4;
  END;
$$;

CREATE FUNCTION _gaussian_statetransition(
    agg_state FLOAT8, -- carry the state!
    weight FLOAT8,
    distance FLOAT8,
    maxdist FLOAT8)
RETURNS FLOAT8
IMMUTABLE
LANGUAGE plpgsql
AS $$
  BEGIN
    RETURN
      agg_state + _gaussian_density(weight, distance, maxdist);
  END;
$$;

CREATE AGGREGATE weighted_density(FLOAT8, FLOAT8, FLOAT8)
(
    sfunc = _gaussian_statetransition,
    stype = FLOAT8,
    initcond = 0
);

我希望仍然能够在聚合之外使用密度计算函数,因此我决定为状态转换添加另一个函数,该函数又使用函数_gaussian_density

聚合然后定义状态类型及其初始状态,我们很高兴。为了正确处理一些边缘情况,我稍微调整了_gaussian_density(也用于处理NULL值)..

非常感谢!

1 个答案:

答案 0 :(得分:1)

函数_gaussian_density()应取决于上一步中计算的值。如果在您的情况下这是第一个参数weight,那么初始条件不应该是0,因为所有下一个计算将得到零作为结果。我假设weight的初始值是1.0:

DROP AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION);
CREATE AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION)
(
    sfunc = _gaussian_density,
    stype = DOUBLE PRECISION,
    initcond = 1.0 -- !!
);

请注意,聚合不使用表的列weight,因为它是内部状态值,只应声明初始条件并将其作为最终结果返回。

SELECT
    idx, time, category,
    weighted_density(distance, 10000) AS wd -- !!
FROM my_table
GROUP BY idx, time, category  
ORDER BY idx, time, category;

 idx |        time         | category |         wd          
-----+---------------------+----------+---------------------
   1 | 2017-01-01 00:00:00 | class_a  |   0.476331421206002
   1 | 2017-01-01 00:00:00 | class_b  |   0.968750868953701
   1 | 2017-01-01 00:13:00 | class_a  |    0.69665860026144
   1 | 2017-01-01 00:13:00 | class_b  |   0.952383202706387
   2 | 2017-01-01 00:00:00 | class_a  | 0.00519142111518706
   2 | 2017-01-27 21:07:00 | class_b  |   0.893107967346503
(6 rows)    

我不确定我是否已正确阅读您的意图,但我的言论应该让您走上正确的道路。