Question

我是现代SQL的新手，可以肯定的是，我过于复杂了。我想做的是通过值本身和频率将值分解为百分位。因此，如果我有1000条记录，其中包含100个不同的数字，那么就会有一个值的范围，以及出现这些值的频率范围。我想要获得的每个值都是

值本身。
其百分位数在值范围内
是计数（频率）
频率的百分位数

我正在使用玩具桌来实验1000条记录，这些记录中填充了来自www.mockaroo.com的随机ish值。我的真实表有成千上万的行。所有这些的目的是将百分位数等固定在视图的每一行的末尾，以提供对百分位数不太好的数据可视化平台。为了清楚起见，如果我从表中的1K条记录开始，则查询应以1K行结束。

CREATE TABLE IF NOT EXISTS mock (
    n integer
);

鉴于我在玩具表中的行数很少，我使用的是十分位数，而不是百分位数，但这不应该改变搜索的结构。这是我从单列表中获取所需内容的方法：

-- Get the value, count, and value percentile.
with
value_counts as (
select n as value,
    count(*) as frequency_count,
    ntile(10) over (order by n) as value_decile
  from mock
group by n
),

-- Now add the frequency percentile.
frequency_analysis as (
select value,
    ntile(10) over (order by frequency_count) as frequency_decile
    from value_counts
),

-- Don't need this CTE, just making things readable.
value_information as (
select value_counts.value,
    value_counts.value_decile,
    value_counts.frequency_count,
    frequency_analysis.frequency_decile
from   value_counts
join frequency_analysis on (frequency_analysis.value= value_counts.value)
)

select * from value_information;

所以，我认为这是可行的。...但是我不仅要检查一列，还需要很多列来生成频率计数，频率百分位数和值百分位数。看来，这应该是一种常见的统计查询，但是我发现很难弄清楚如何在Postgres中做到这一点。进入两列表格之前要注意以下几点：

我正在使用CTE来提高可读性，以创建可以在以后的CTE中进行聚合的聚集体，并生产可以加入的小型产品。拥有1K条记录的一切都很快，但是拥有2000万条记录的情况可能不那么快。
我使用的是Postgres 11，直到发布后的一段时间才会迁移到PG 12。因此，对于CTE如何实现尚无关注。 PG 11的行为正是我想要的。
我正在使用ntile（），因为动态调整垃圾箱的数量非常简单，所以10个而不是100个。我不确定是否应该使用width_bucket，percentile_cont / percentile_disc。我今天早上才发现width_bucket，所以也许还有另一种内置的百分位方法。（除了手动编码。）
如果有更好的方法，我完全乐于抛弃所有这些内容并进行其他操作。这就是为什么我要开始尝试玩具数据的原因。

好吧，现在是一个具有更有意义的字段名称的更现实的表的示例：

CREATE TABLE IF NOT EXISTS ascendco.mock2 (
    num_inst integer,
    points integer
);

再次在两列中具有不同值的1,000条记录。两列在计算中没有关系，它们都在同一行中，我希望将聚合添加到末尾。这听起来很像是窗口函数之类的操作，但是我不想在1000万行以上的工作上进行迭代。那么，如何对两列以上的列执行此操作？我今天的大部分桌位都只有2-5，而且这个数字还会增长。我遇到的是GROUP BY是为了给出一个聚合级别，我需要在每一列上进行不同的聚合。除了我在下面尝试过的长格式之外，还有其他方法吗？

-- Get the num_inst count and the decile for the value.
with 
num_inst_distinct_counts as (
  select num_inst,
         count(*) as num_inst_frequency,
         ntile(10) over (order by num_inst) as value_decile
    from mock2 
group by num_inst),

-- Extend the previous CTE with the decile for the value's frequency. 
num_inst_information as (
    select *,
           ntile(10) over (order by num_inst_frequency) as frequency_decile
      from num_inst_distinct_counts
),

-- Get the points count and the decile for the value.
points_distinct_counts as (
  select points,
         count(*) as points_frequency,
         ntile(10) over (order by points) as value_decile
    from mock2 
group by points),

-- Extend the previous CTE with the decile for the value's frequency. 
points_information as (
    select *,
           ntile(10) over (order by points_frequency) as frequency_decile
      from points_distinct_counts
)

-- Put it all togehter. I could have used more general names in the CTEs, but this makes the output clearer
select mock2.num_inst,
       num_inst_information.value_decile as num_inst_value_decile,
       num_inst_information.frequency_decile as num_inst_frequency_decile,

       mock2.points,
       points_information.value_decile as points_value_decile,
       points_information.frequency_decile as points_frequency_decile

from mock2
join num_inst_information on (num_inst_information.num_inst = mock2.num_inst)
join points_information   on (points_information.points = mock2.points)

order by 1;

那让我震惊了很长时间，而且很累。我猜想有一种整齐的方法可以使用数组和LATERAL联接来完成工作。

感谢您的帮助！

醒来，再试一次。好吧，我对count（*）之类的东西一无所知。您可以至少嵌套一点聚合。下面修订的程序与原始程序具有相同的结果，因此有一些改进。我仍然很直觉，我遗漏了一些显而易见的东西，这些东西可能会使所有这些事情变得更加简单。有任何想法吗？作为记录，这是一栏和两栏查询的新版本。每个语句底部的长ORDER BY语句仅在这里，以便我可以轻松地获取和区分原始查询和修订查询的输出。

这是新的一栏查询，更短了：

  select distinct n as value,
          ntile(10) over (order by n) as value_decile,
          count(*) as frequency_count,
          ntile(10) over (order by count(*)) as frequency_decile
    from mock
group by n
order by 1,2,3,4

两列版本每列使用一个CTE，然后将所有内容与主表连接在一起。

-获取num_inst列的详细信息。

with 
num_inst_information as (
  select distinct num_inst,
          ntile(10) over (order by num_inst) as value_decile,
          count(*) as frequency_count,
          ntile(10) over (order by count(num_inst)) as frequency_decile
    from mock2
group by num_inst
),

-- Get the details for the points column.
points_information as (
  select distinct points,
          ntile(10) over (order by points) as value_decile,
          count(*) as frequency_count,
          ntile(10) over (order by count(points)) as frequency_decile
    from mock2
group by points
)

-- Get every row in the base table and use the CTEs above for lookups (joins) with the extra data.
select mock2.num_inst,
       num_inst_information.value_decile as num_inst_value_decile,
       num_inst_information.frequency_decile num_inst_frequency_decile,

       mock2.points,
       points_information.value_decile as points_value_decile,
       points_information.frequency_decile as points_frequency_decile

from mock2
left join num_inst_information on (num_inst_information.num_inst = mock2.num_inst)
left join points_information   on (points_information.points = mock2.points)
order by 1,2,3,4,5,6

新版本要快一些，这也很好。

S-Man要求提供一些样本数据和输出。很公平！并感谢您的阅读并希望对您有所帮助。我已经建立了带有1列示例的Pastebin帐户

https://pastebin.com/embed_js/eUZkBqhA

和2列数据： https://pastebin.com/embed_js/7J4vx850

但是，老实说，它们只是随机数。我的问题的目的是找出一种简洁有效的方法，将派生的总体数据添加到每一行：

从输出中的每个记录获取原始值。表中有1M行，结果中有1M行。
对于每一行，在每个“真实”数据列的输出中添加三个新列：

1）值的百分位数。 2）值的频率（计数） 3）频率的百分位数。

例如，

num_inst 基本表中的真实数据。

num_inst_value_percentile 表中所有num_inst中该num_inst的百分位数（上面使用的十分位数）。

num_inst_frequency 该值在整个表格中出现的频率如何？所以，伯爵。

num_inst_frequency_percentile 表格中所有num_inst_frequency中该num_inst_frequency的百分位数（上面已使用的十分位数）。

...然后对于基点字段，以及在多个表中的许多其他字段，都相同。

万一有人想知道，我们的数据是长尾的并且很难绘制图表。将数据绑定到百分位数可以更轻松地找出数据/分布的形状。完成此操作后，下一步就是弄清楚如何使用窗口函数（我想）来获取每个百分位数的范围大小。

希望这更加清楚！

计算Postgres 11中的值的百分比，频率和百分比频率

0 个答案: