如何使用PL / pgSQL将原始表转换为postgresql中所需的两个表?

时间:2015-01-27 06:25:19

标签: postgresql

假设我们有这个表thing_happened的架构看起来像这样:

CREATE TABLE things_happened
(
  zipcode character varying(10),
  city character varying(50),
  state character varying(2),
  metro character varying(50),
  countyname character varying(50),
  "1996-04" integer,
  "1996-05" integer,
  "1996-06" integer,
  "1996-07" integer,
  "1996-08" integer,
  "1996-09" integer,
  ...
  "2014-09" integer,
  "2014-10" integer,
  "2014-11" integer
)

看起来很有趣,因为数据是由其他人从csv文件导入的。

很明显,这个表效率不高,特定区域的许多月的值都是空的。所以我在考虑创建两个表格。

所需两个表的模式是:

area_info (zipcode, city, state, metro, countyname) with zipcode as primary key
things_happened_per_month (year, month, zipcode, times) with year, month, zipcode as primary key

因为表的大小很大并且数据不断出现并且列名必须成为参数,所以我想知道如何使用“PL / pgSQL - SQL过程语言”来完成它吗?还是其他任何有效的解决方案?

1 个答案:

答案 0 :(得分:1)

您的table things_happened看起来像一个数据透视表,您希望将其规范化为更高效的数据结构。你必须编写一个PlPgSQL函数才能做到这一点。

由于您有几个月的时间,并且可能会在以后几个月添加更多列,因此我建议您动态确定表中的月份列,然后循环结果。在下面的示例中,我假设您已将area_info数据复制到其自己的表中;我把重点放在"时代" thpm表中的列(我假设你已经创建了它)。

动态解决方案

下面的函数使用表中YYYY-DD列的动态查找,然后遍历记录和列以将数据导入规范化表。 (非常感谢Pavel Stehule在代码中指出了最后一个错误的错误。)

CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
  col_names text[];
  period    text;
  th        things_happened%rowtype;
  times     integer;
BEGIN
  -- Get the currently present columns from the catalog
  SELECT array_agg(attname::text) INTO col_names
  FROM pg_attribute att
  JOIN pg_class c ON c.oid = att.attrelid
  JOIN pg_namespace n ON n.oid = c.relnamespace
  WHERE c.relname = 'things_happened'
    AND n.nspname = 'public'
    AND position('-' in attname) = 5; -- only "times" columns

  -- Loop over all the rows in the things_happened table
  FOR th IN SELECT * FROM things_happened LOOP
    -- Now loop over column names
    FOREACH period IN ARRAY col_names LOOP
      -- Fudge the proper column from the th record into a local variable
      EXECUTE 'SELECT $1.' || quote_ident(period) INTO times USING th;

      -- If times is a proper value, insert it into the thpm table
      IF times IS NOT NULL THEN
        INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES
          (substring(period from 1 for 4)::int, substring(period from 6 for 2)::int, th.zipcode, times);
      END IF;
    END LOOP;
  END LOOP;
END; $$ LANGUAGE plpgsql;

这应该是一次性的练习。如果原始表继续获取新数据,则应定期运行此函数,然后在最里面的UPSERT中执行EXECUTE:首先尝试执行&{34}的UPDATE。倍"值,如果因为没有年,月,邮政编码组合的数据而失败,则执行INSERT。有关示例,请参阅此处的许多其他问题。

静态解决方案

以下功能是非动态变体。您必须在things_happened表中每月放置单独的代码块。

CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
  th        things_happened%rowtype;
  times     integer;
BEGIN
  -- Loop over all the rows in the things_happened table
  FOR th IN SELECT * FROM things_happened LOOP
    -- Copy the below block for 1996, April, for all other months.
    SELECT th."1996-04" INTO times;
    IF times IS NOT NULL THEN
      INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 4, th.zipcode, times);
    -- 1996, May
    SELECT th."1996-05" INTO times;
    IF times IS NOT NULL THEN
      INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 5, th.zipcode, times);
    END IF;
    -- Etc.
  END LOOP;
END; $$ LANGUAGE plpgsql;

丑陋,但功能齐全。