我正在处理一个非常大的数据集并且现在遇到一个问题,我不确定当前的方法是否可以解决。我很好地发布这个,因为我没有提出最初的例子,但我们的任务是接受它。此时重新编写逻辑将是一个非常重要的步骤。
该项目在数据仓库上运行报告,但为了使事情更加友好,我创建了一个示例来说明我遇到的问题。
CREATE TEMPORARY TABLE test_customers2 (
id integer PRIMARY KEY,
first_name varchar(40) NOT NULL,
last_name varchar(40) NOT NULL,
newsletter integer NOT NULL,
vipmember integer NOT NULL
);
INSERT INTO test_customers2 VALUES(1, 'Reed', 'Richards', 1, 1);
INSERT INTO test_customers2 VALUES(2, 'Johnny', 'Storm', 0, 1);
INSERT INTO test_customers2 VALUES(3, 'Peter', 'Parker', 1, 0);
CREATE TEMPORARY TABLE test_purchases (
id integer CONSTRAINT firstkey2 PRIMARY KEY,
cid integer NOT NULL
);
INSERT INTO test_purchases VALUES(1, 1);
INSERT INTO test_purchases VALUES(2, 2);
INSERT INTO test_purchases VALUES(3, 2);
INSERT INTO test_purchases VALUES(4, 3);
SELECT
COUNT(distinct c.id) as "Total Customers"
,COUNT(distinct p.id) as "Total Sales"
,COUNT(distinct p.id)::decimal/COUNT(distinct c.id)::decimal as "Sales per customer"
,SUM(c.newsletter) as "Subscribed"
,SUM(c.newsletter)::decimal/COUNT(c.newsletter)::decimal as "Pct Subscribed"
,SUM(c.vipmember) as "VIP"
,SUM(c.vipmember)::decimal/COUNT(c.vipmember)::decimal as "Pct VIP"
FROM test_customers2 c
INNER JOIN test_purchases p ON c.id = p.cid
当你在最后执行SELECT时,你会得到结果:
3 | 4 | 1.33... | 2 | 0.50... | 3 | 0.75...
问题是,由于加入,它正在抛弃我的结果,因为我真的在寻找这些结果:
3 | 4 | 1.33... | 2 | 0.66... | 2 | 0.66...
distinct有助于唯一值,但布尔值(在本例中字面意思是int,未指定为boolean)不适用于该方法,因为它们只有可选值为1,0或null。我想我可能需要对它进行子查询,但除了性能下降之外,重写大量代码也会有点受欢迎。还有其他更好的方法可能会丢失吗?
答案 0 :(得分:2)
问题在于,您只是为了将单独的表中的列添加到行集中而执行连接 - 您实际上并未实际使用两个源表之间的关系,也不是你想做什么。总体而言,这只是因为您希望关联聚合数据的各个方面,以及 您应该加入的数据。
我建议在单独的内联视图/ CTE中计算单表统计信息,然后(交叉)连接两个单行结果以获得另一个单行来执行最终选择。像这样的东西,例如:
SELECT
c.c_count as "Total Customers"
,p.p_count as "Total Sales"
,p.p_count::decimal/c.c_count::decimal as "Sales per customer"
,c.nl_sum as "Subscribed"
,c.nl_sum::decimal/c.c_count::decimal as "Pct Subscribed"
,c.vipsum as "VIP"
,c.vipsum::decimal/c.c_count::decimal as "Pct VIP"
FROM
(
SELECT
count(*) as c_count,
sum(newsletter) as nl_sum,
sum(vipmember) as vip_sum
FROM test_customers2
) c
CROSS JOIN
(
SELECT COUNT(*) AS p_count FROM test_purchases
) p
答案 1 :(得分:0)
您实际上并不需要加入。您的逻辑都不需要匹配2个表。这是MSSQL中的查询(抱歉,我不知道Postgres),但我认为你可以翻译。
SELECT COUNT(*) as "Total Customers",
(SELECT COUNT(*) FROM test_purchases) as "Total Sales",
CAST((SELECT COUNT(*) FROM test_purchases) AS DECIMAL) / COUNT(*) as "Sales per Customer",
SUM(c.newsletter) as "Suscribed",
CAST(SUM(c.newsletter) AS DECIMAL) / COUNT(*) as "Pct Suscribed",
SUM(c.vipmember) as "VIP",
CAST(SUM(c.newsletter) AS DECIMAL) / COUNT(*) as "Pct VIP"
FROM test_customers2 c
答案 2 :(得分:0)
可能更多"灵活":
SELECT
COUNT(c.id) as "Total Customers"
,SUM(p.total_sales) as "Total Sales"
,SUM(p.total_sales)::decimal/COUNT(c.id)::decimal as "Sales per customer"
,SUM(c.newsletter) as "Subscribed"
,SUM(c.newsletter)::decimal/COUNT(c.newsletter)::decimal as "Pct Subscribed"
,SUM(c.vipmember) as "VIP"
,SUM(c.vipmember)::decimal/COUNT(c.vipmember)::decimal as "Pct VIP"
FROM test_customers2 c
JOIN (select cid, count(*) as total_sales from test_purchases group by cid) p ON c.id = p.cid