Question

我有一系列关于不同商家的搜索查询的数据。我有一个python脚本，第一个创建头部，躯干和qsql中主表的tail查询集，基于count（查询）实例为1000,100等。

由于我的脚本运行的商家数量可能/不具有满足该阈值的查询，因此脚本不会记录始终生成的“head.csv”“torso.csv”.. tail.csv。如何将查询分解为头部，躯干和通过尊重上面的逻辑尾部组。

我也尝试通过百分位（33,33,33）打破群体，但这会使头部和头部都倾斜。躯干，如果商人的尾巴很长。

当前：

# head
select  trim(query) as query, count(*) 
from my_merchant_table
-- other conditions & date range 
GROUP BY trim(query)
having count(*) >=1000 

#torso
select  trim(query) as query, count(*) 
from my_merchant_table
-- other conditions & date range 
GROUP BY trim(query)
having count(*) <1000  and count(*) >=100 

#tail
select  trim(query) as query, count(*) 
from my_merchant_table
-- other conditions & date range 
GROUP BY trim(query)
having count(*) <100 

# using ntile - but note that I have percentiles of "3" , 33.#% each, which introduces the skew
select trim(query), count(*) as query_count,
       ntile(3) over(order by query_count desc) AS group_ntile
     from my_merchant_table
    group by trim(query)
     order by query_count desc  limit 100;

理想情况下，解决方案可以建立在此之上 - ：

select trim(query), count(*) as query_count,
       ntile(100) over(order by query_count desc) AS group_ntile
     from my_merchant_table
      -- other conditions & date range 
    group by trim(query)
     order by query_count desc

这给了，

btrim   query_count group_ntile
q0  1277    1
q1  495 1
q2  357 1
q3  246 1
# so on till group_ntile =100 , while the query_count reduces.

问题：什么是逻辑的最佳方式，使整个逻辑商家不可知/没有硬编码的配置？

注意：我在Redshift中获取数据，该解决方案应与postgres 8.0＆amp;特别是红移。

Answer 1

我想你从某种编程语言中调用它的查询来处理信息。我在这方面的建议是获取所有记录并对它们应用过滤器。考虑一下，如果您查询数据库中存在多个数据操作，则会导致应用程序的响应时间受到影响。

Answer 2

假设主要挑战是制作“瓷砖”。从值列表中，这里是一些示例代码。它需要加拿大的13个省份，并将其分为要求数量的群体。它使用省名，但数字也可以。

SELECT * FROM Provinces ORDER BY province;  -- To see what we are working with
+---------------------------+
| province                  |
+---------------------------+
| Alberta                   |
| British Columbia          |
| Manitoba                  |
| New Brunswick             |
| Newfoundland and Labrador |
| Northwest Territories     |
| Nova Scotia               |
| Nunavut                   |
| Ontario                   |
| Prince Edward Island      |
| Quebec                    |
| Saskatchewan              |
| Yukon                     |
+---------------------------+
13 rows in set (0.00 sec)

现在代码：

SELECT @n := COUNT(*),   -- Find total count (13)
       @j := 0.5,        -- 'trust me'
       @tiles := 3       -- The number of groupings
    FROM Provinces;

SELECT group_start
    FROM (
        SELECT
            IF((@j * @tiles) % @n < @tiles, province, NULL) AS group_start,
            @j := @j + 1
        FROM Provinces
        ORDER BY province
         ) x
    WHERE group_start IS NOT NULL;

+---------------------------+
| group_start               |
+---------------------------+
| Alberta                   |
| Newfoundland and Labrador |
| Prince Edward Island      |
+---------------------------+
3 rows in set (0.00 sec)

将@tiles设置为4：

+---------------+
| group_start   |
+---------------+
| Alberta       |
| New Brunswick |
| Nova Scotia   |
| Quebec        |
+---------------+
4 rows in set (0.00 sec)

合理有效：1次传递计数行数，1次传递进行计算，1次传递过滤掉非中断值。

从sql动态选择阈值

2 个答案: