Question

我正在尝试为“区域健康资源文件”中的列生成一个包含非空数据点计数的表 - 其中包含每个县的人口统计和健康数据。

我已将所提供的格式的数据重新编写为时间序列在一堆名为“series_”的表中，对于某些数据类别foo，和由县FIPS和年份确定的行（多年调查的初始和最终）。

现在想要在时间序列列上产生计数。到目前为止，我的查询是：

do language plpgsql $$
declare
  query text;
begin
  query := (with cats as (
    select tcategory, format('series_%s', tcategory) series_tbl
    from series_categories),
  cols as (
    select tcategory, series_tbl, attname col
    from pg_attribute a join pg_class r on a.attrelid = r.oid
      join cats c on c.series_tbl = r.relname
    where attname not in ('FIPS', 'initial', 'final')
    and attnum >= 0
    order by tcategory, col),
  scols as (
    select tcategory, series_tbl, col,
    format('count(%s)', quote_ident(col)) sel
    from cols),
  sel as (
    select format(
      E'  (select %s tcategory, %s col, %s from %s)\n', 
        quote_literal(tcategory), quote_literal(col), sel, series_tbl) q
   from scols)
  select string_agg(q, E'union\n') from sel);
  execute format(
    'select * into category_column_counts from (%s) x', query);
end;
$$;

（这里“series_categories”表有类别名称。）这......“有效”，但可能要慢几百倍。它做~10,000 个别的桌子，可以减少500倍，因为只有20-ish 类别。我想使用select count(col1), count(col2) ... 对于每个表，然后“排除”这些行记录并将它们连接在一起。

我还没弄明白。我在看： https://stackoverflow.com/a/14087244/435563

获取灵感，但尚未成功转化。

Answer 1

我不知道AHRF格式（我查了一下网站，但是有太多可爱的护士照片让我专注于内容......）但是你可能在第一次看错了将数据提取到多个表中，然后再次尝试将其重新组合在一起。相反，您应该使用名为Entity-Attribute-Value的设计模式，该模式将所有数据值存储在具有类别标识符和＆＃34;特征的单个表中。标识符，表结构有点像这样：

CREATE TABLE categories (
  id          serial PRIMARY KEY,
  category    text NOT NULL,
  ... -- other attributes like min/max allowable values, measurement technique, etc.
);

CREATE TABLE feature ( -- town, county, state, whatever
  id          serial PRIMARY KEY,
  fips        varchar NOT NULL,
  name        varchar,
  ... -- other attributes
);

CREATE TABLE measurement (
  feature     integer REFERENCES feature,
  category    integer REFERENCES categories,
  dt          date,
  value       double precision NOT NULL,
  PRIMARY KEY (feature, category, dt)
);

这种设计模式非常灵活。例如，您可以为一个要素类的某些行存储50个类别，对另一个要素类仅存储5个类别。您可以在不同日期或年份存储来自多个观察的数据。你可以有多个＆＃34;功能＆＃34;具有单独＆＃34;测量的表格＃34;表，或者您可以使用表继承进行设置。

使用标准的PK-FK关系来回答您的查询非常简单。更重要的是，回答任何查询比使用当前的除法但不是征服的结构要容易得多。

我不确切地知道你的第一年＆＃34; \＃34;最后一年＆＃34;数据有效，但是否则您的要求将通过如下简单查询来满足：

SELECT f.fips, c.category, count(*) 
FROM feature f -- replace feature by whatever real table you create, like "county"
JOIN measurement m ON m.feature = f.id
JOIN categories c ON c.id = m.category
GROUP BY f.fips, c.category;

你想知道蛀牙是吸烟，饮酒与精神病治疗，肥胖与药物滥用之间的关系，幼儿发育趋势的函数吗？使用上述结构一切都相当容易，所有这些都是一个带有多个表格的缓慢痛苦的步骤。

Answer 2

以下是我发现的优化：它使用json_each(row_to_json(c))将记录转换为单个值的序列。

do language plpgsql $$
declare
  query text;
begin
  query := (with cats as (
    select tcategory, table_name
    from series_category_tables),
  cols as (
    select tcategory, table_name, attname col, typname type_name
    from pg_attribute a join pg_class r on a.attrelid = r.oid
      join cats c on c.table_name = r.relname
      join pg_type t on t.oid = a.atttypid
    where attname not in ('FIPS', 'initial', 'final')
    and attnum >= 0
    order by tcategory, col),
  -- individual "count" fields
  sel as (
    select 
      format(
        E'  (select %s tcategory, %s table_name, \n'
        || E'    d.key column_name, d.value->>''f2'' type_name, '
        || E'(d.value->>''f1'')::int count\n'
        || E'  from (\n'
        || E'    select (json_each(row_to_json(c))).* from (select\n' 
        || E'      %s \n'
        || E'    from %s) c) d)\n',
      quote_literal(tcategory),
      quote_literal(table_name),
      string_agg(
        format(
          '      row(count(%1$s), %2$s) %1$s',
          quote_ident(col), quote_literal(type_name)), 
        E',\n'), quote_ident(table_name)) selstr
    from cols
    group by tcategory, table_name),
  selu as (
    select 
      string_agg(selstr, E'union\n') selu
    from sel)
  select * from selu);
  drop table if exists category_columns;
  create table category_columns (
    tcategory text, table_name text, 
    column_name text, type_name text, count int);
  execute format(
    'insert into category_columns select *  from (%s) x', query);

end;
$$;

它运行时间约为45秒，前一版本运行时间为6分钟。我/你能比这更好吗？

如何创建postgres查询以生成指定为数据的表的列数

2 个答案: