我有一系列关于不同商家的搜索查询的数据。 我有一个python脚本,第一个创建头部,躯干和qsql中主表的tail查询集,基于count(查询)实例为1000,100等。
由于我的脚本运行的商家数量可能/不具有满足该阈值的查询,因此脚本不会记录始终生成的“head.csv”“torso.csv”.. tail.csv。 如何将查询分解为头部,躯干和通过尊重上面的逻辑尾部组。
我也尝试通过百分位(33,33,33)打破群体,但这会使头部和头部都倾斜。躯干,如果商人的尾巴很长。
当前:
# head
select trim(query) as query, count(*)
from my_merchant_table
-- other conditions & date range
GROUP BY trim(query)
having count(*) >=1000
#torso
select trim(query) as query, count(*)
from my_merchant_table
-- other conditions & date range
GROUP BY trim(query)
having count(*) <1000 and count(*) >=100
#tail
select trim(query) as query, count(*)
from my_merchant_table
-- other conditions & date range
GROUP BY trim(query)
having count(*) <100
# using ntile - but note that I have percentiles of "3" , 33.#% each, which introduces the skew
select trim(query), count(*) as query_count,
ntile(3) over(order by query_count desc) AS group_ntile
from my_merchant_table
group by trim(query)
order by query_count desc limit 100;
理想情况下,解决方案可以建立在此之上 - :
select trim(query), count(*) as query_count,
ntile(100) over(order by query_count desc) AS group_ntile
from my_merchant_table
-- other conditions & date range
group by trim(query)
order by query_count desc
这给了,
btrim query_count group_ntile
q0 1277 1
q1 495 1
q2 357 1
q3 246 1
# so on till group_ntile =100 , while the query_count reduces.
问题: 什么是逻辑的最佳方式,使整个逻辑商家不可知/没有硬编码的配置?
注意:我在Redshift中获取数据,该解决方案应与postgres 8.0&amp;特别是红移。
答案 0 :(得分:0)
我想你从某种编程语言中调用它的查询来处理信息。我在这方面的建议是获取所有记录并对它们应用过滤器。考虑一下,如果您查询数据库中存在多个数据操作,则会导致应用程序的响应时间受到影响。
答案 1 :(得分:0)
假设主要挑战是制作“瓷砖”。从值列表中,这里是一些示例代码。它需要加拿大的13个省份,并将其分为要求数量的群体。它使用省名,但数字也可以。
SELECT * FROM Provinces ORDER BY province; -- To see what we are working with
+---------------------------+
| province |
+---------------------------+
| Alberta |
| British Columbia |
| Manitoba |
| New Brunswick |
| Newfoundland and Labrador |
| Northwest Territories |
| Nova Scotia |
| Nunavut |
| Ontario |
| Prince Edward Island |
| Quebec |
| Saskatchewan |
| Yukon |
+---------------------------+
13 rows in set (0.00 sec)
现在代码:
SELECT @n := COUNT(*), -- Find total count (13)
@j := 0.5, -- 'trust me'
@tiles := 3 -- The number of groupings
FROM Provinces;
SELECT group_start
FROM (
SELECT
IF((@j * @tiles) % @n < @tiles, province, NULL) AS group_start,
@j := @j + 1
FROM Provinces
ORDER BY province
) x
WHERE group_start IS NOT NULL;
+---------------------------+
| group_start |
+---------------------------+
| Alberta |
| Newfoundland and Labrador |
| Prince Edward Island |
+---------------------------+
3 rows in set (0.00 sec)
将@tiles设置为4:
+---------------+
| group_start |
+---------------+
| Alberta |
| New Brunswick |
| Nova Scotia |
| Quebec |
+---------------+
4 rows in set (0.00 sec)
合理有效:1次传递计数行数,1次传递进行计算,1次传递过滤掉非中断值。