假设我们有这个表thing_happened的架构看起来像这样:
CREATE TABLE things_happened
(
zipcode character varying(10),
city character varying(50),
state character varying(2),
metro character varying(50),
countyname character varying(50),
"1996-04" integer,
"1996-05" integer,
"1996-06" integer,
"1996-07" integer,
"1996-08" integer,
"1996-09" integer,
...
"2014-09" integer,
"2014-10" integer,
"2014-11" integer
)
看起来很有趣,因为数据是由其他人从csv文件导入的。
很明显,这个表效率不高,特定区域的许多月的值都是空的。所以我在考虑创建两个表格。
所需两个表的模式是:
area_info (zipcode, city, state, metro, countyname) with zipcode as primary key
things_happened_per_month (year, month, zipcode, times) with year, month, zipcode as primary key
因为表的大小很大并且数据不断出现并且列名必须成为参数,所以我想知道如何使用“PL / pgSQL - SQL过程语言”来完成它吗?还是其他任何有效的解决方案?
答案 0 :(得分:1)
您的table things_happened看起来像一个数据透视表,您希望将其规范化为更高效的数据结构。你必须编写一个PlPgSQL函数才能做到这一点。
由于您有几个月的时间,并且可能会在以后几个月添加更多列,因此我建议您动态确定表中的月份列,然后循环结果。在下面的示例中,我假设您已将area_info数据复制到其自己的表中;我把重点放在"时代" thpm表中的列(我假设你已经创建了它)。
下面的函数使用表中YYYY-DD列的动态查找,然后遍历记录和列以将数据导入规范化表。 (非常感谢Pavel Stehule在代码中指出了最后一个错误的错误。)
CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
col_names text[];
period text;
th things_happened%rowtype;
times integer;
BEGIN
-- Get the currently present columns from the catalog
SELECT array_agg(attname::text) INTO col_names
FROM pg_attribute att
JOIN pg_class c ON c.oid = att.attrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'things_happened'
AND n.nspname = 'public'
AND position('-' in attname) = 5; -- only "times" columns
-- Loop over all the rows in the things_happened table
FOR th IN SELECT * FROM things_happened LOOP
-- Now loop over column names
FOREACH period IN ARRAY col_names LOOP
-- Fudge the proper column from the th record into a local variable
EXECUTE 'SELECT $1.' || quote_ident(period) INTO times USING th;
-- If times is a proper value, insert it into the thpm table
IF times IS NOT NULL THEN
INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES
(substring(period from 1 for 4)::int, substring(period from 6 for 2)::int, th.zipcode, times);
END IF;
END LOOP;
END LOOP;
END; $$ LANGUAGE plpgsql;
这应该是一次性的练习。如果原始表继续获取新数据,则应定期运行此函数,然后在最里面的UPSERT
中执行EXECUTE
:首先尝试执行&{34}的UPDATE
。倍"值,如果因为没有年,月,邮政编码组合的数据而失败,则执行INSERT
。有关示例,请参阅此处的许多其他问题。
以下功能是非动态变体。您必须在things_happened表中每月放置单独的代码块。
CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
th things_happened%rowtype;
times integer;
BEGIN
-- Loop over all the rows in the things_happened table
FOR th IN SELECT * FROM things_happened LOOP
-- Copy the below block for 1996, April, for all other months.
SELECT th."1996-04" INTO times;
IF times IS NOT NULL THEN
INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 4, th.zipcode, times);
-- 1996, May
SELECT th."1996-05" INTO times;
IF times IS NOT NULL THEN
INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 5, th.zipcode, times);
END IF;
-- Etc.
END LOOP;
END; $$ LANGUAGE plpgsql;
丑陋,但功能齐全。