CDC增长图表数据集提供了我正在努力完成的一个很好的例子: http://www.cdc.gov/growthcharts/html_charts/statage.htm
假设他们的表格已转换为以下形式:
包含列的cdc表:chart_label,sex,age,tau,value
with tmp (chart_label, sex, age, tau, val) as (values
('bmi for age','F',2,0.03,14.14735),
('bmi for age','F',2,0.05,14.39787),
('bmi for age','F',2,0.1,14.80134),
('bmi for age','F',2,0.25,15.52808),
('bmi for age','F',2,0.5,16.4234),
('bmi for age','F',2,0.75,17.42746),
('bmi for age','F',2,0.85,18.01821),
('bmi for age','F',2,0.9,18.44139),
('bmi for age','F',2,0.95,19.10624),
('bmi for age','F',2,0.97,19.56411),
('bmi for age','F',2.041667,0.03,14.13226),
('bmi for age','F',2.041667,0.05,14.38019),
('bmi for age','F',2.041667,0.1,14.77965),
('bmi for age','F',2.041667,0.25,15.49976),
('bmi for age','F',2.041667,0.5,16.38804),
('bmi for age','F',2.041667,0.75,17.38582),
('bmi for age','F',2.041667,0.85,17.97371),
('bmi for age','F',2.041667,0.9,18.39526),
('bmi for age','F',2.041667,0.95,19.05824),
('bmi for age','F',2.041667,0.97,19.51534))
select * from tmp;
我想写一个PostgreSQL函数来返回给定图表,性别,年龄和值的估计tau,如果没有可用于输入的确切值,则使用线性插值来估计tau。
例如(伪代码):
select interp('bmi for age', 'F', 2.02, 15);
应返回介于.1和.25之间的tau值(大约0.141),因为它将在这两行之间进行插值:
('bmi for age','F',2,0.1,14.80134),
('bmi for age','F',2,0.25,15.52808),
我确实认为线性插值可能不是找到合适百分位数的理想解决方案,但正如我所说,CDC增长图表是我实际用例的合适近似值。
答案 0 :(得分:0)
我想出了一些基于SO搜索,问题链接和文档的解决方案。关于每个解决方案的不幸之处在于它们相对较慢,因为每个值都会调用一次查找。
此外,在处理边界条件时,每个都可能通过错误处理,输入验证和更好的逻辑来改进。现在,如果请求的值超出表的范围,我只返回低/高极值。
SQL解决方案:
create or replace function cdcInterp(_valtype text,
_insex character(1),
_inage numeric,
_inval numeric)
-- _valtype should be one of either 'bmi for age', 'wt for age', or 'ht for age'
-- _insex should be one of either 'M' or 'F'
returns numeric as
$$
-- make a lookup table
with lkup as (
select *
from cdc_chart_value
where chart_label = _valtype
and sex = _insex
order by abs(age - _inage) asc, age, tau
-- order by ensures that I am using the closest age,
-- with ties defaulting to the younger age
-- 10 is a magic number: it is the number of taus for each age
-- (0.03, 0.05, 0.10, 0.25, 0.50, 0.75, 0.85, 0.90, 0.95, 0.97)
limit 10
),
-- find high and low values needed to do interpolation
vals as (select
-- x1 is the lower value
(SELECT lkup.val FROM lkup WHERE lkup.val <= _inval ORDER BY lkup.val DESC LIMIT 1) as x1,
-- x2 is the upper value
(SELECT lkup.val FROM lkup WHERE lkup.val >= _inval ORDER BY lkup.val ASC LIMIT 1) as x2,
-- y1 is the lower tau
(SELECT lkup.tau FROM lkup WHERE lkup.val <= _inval ORDER BY lkup.val DESC LIMIT 1) as y1,
-- y2 is the upper tau
(SELECT lkup.tau FROM lkup WHERE lkup.val >= _inval ORDER BY lkup.val ASC LIMIT 1) as y2
from lkup)
-- interpolate, or not, as needed
SELECT
CASE
WHEN vals.x1 = vals.x2 THEN vals.y1 -- if equal, then return the exact tau
when vals.x1 is null then vals.y2 -- if the lower value is null, then return the lowest tau (.03)
when vals.x2 is null then vals.y1 -- if the upper value is null, then returr the highest tau (.97)
ELSE (vals.y1 + (_inval-vals.x1)/(vals.x2-vals.x1)*(vals.y2-vals.y1)) -- otherwise interpolate linearly
END AS y
FROM vals
$$
language sql stable;
这比我希望的慢一点(每个查询33毫秒)。想知道是否有办法更快地做到这一点?
PLPGSQL解决方案:(比SQL解决方案长约50%)
create or replace function interp2(_valtype text,
_insex character(1),
_inage numeric,
_inval numeric)
returns numeric as
$$
DECLARE
x1 numeric;
x2 numeric;
y1 numeric;
y2 numeric;
y numeric;
begin
-- the overhead of creating/dropping a temporary table is bad
drop table if exists _tmp_lkup;
create temp table _tmp_lkup as
(select *
from cdc_chart_value
where chart_label = _valtype
and sex = _insex
order by abs(age - _inage) asc, age, tau
-- order by ensures that I am using the closest age,
-- with ties defaulting to the younger age
-- 10 is a magic number: it is the number of taus for each age
-- (0.03, 0.05, 0.10, 0.25, 0.50, 0.75, 0.85, 0.90, 0.95, 0.97)
limit 10
);
x1 := (SELECT _tmp_lkup.val FROM _tmp_lkup WHERE _tmp_lkup.val <= _inval ORDER BY _tmp_lkup.val DESC LIMIT 1);
x2 := (SELECT _tmp_lkup.val FROM _tmp_lkup WHERE _tmp_lkup.val >= _inval ORDER BY _tmp_lkup.val ASC LIMIT 1);
y1 := (SELECT _tmp_lkup.tau FROM _tmp_lkup WHERE _tmp_lkup.val <= _inval ORDER BY _tmp_lkup.val DESC LIMIT 1);
y2 := (SELECT _tmp_lkup.tau FROM _tmp_lkup WHERE _tmp_lkup.val >= _inval ORDER BY _tmp_lkup.val ASC LIMIT 1);
-- interpolate, or not, as needed
y := (select CASE
WHEN x1 = x2 THEN y1 -- if equal, then return the exact tau
when x1 is null then y2 -- if the lower value is null, then return the lowest tau (.05)
when x2 is null then y1 -- if the upper value is null, then retunr the highest tau (.95)
ELSE (y1 + (_inval-x1)/(x2-x1)*(y2-y1)) -- otherwise interpolate linearly
END);
return y;
end;
$$ language plpgsql volatile;
我相信更快的解决方案是减少创建查找的次数。例如,通过在性别上使用plpgsql循环并插入所有男性点,然后是所有女性点,并返回两组结果的并集?
另一种可能的解决方案可能是使用python / scipy扩展中的griddata
插值。