我想从一个大表中随机选择20行,并使用以下查询工作正常:
SELECT id
FROM timeseriesentry
WHERE random() < 20*1.0/12940622
(12940622是表中的行数)。我现在想要自动检索行数并使用
WITH tmp AS (SELECT COUNT(*) n FROM timeseriesentry)
SELECT id
FROM timeseriesentry, tmp
WHERE random() < 20*1.0/n
即使n是正确的,也会产生零行。
我在这里缺少什么?
编辑:id不是数字,这就是为什么我无法创建随机系列来从中进行选择的原因。我需要建议的结构,因为我的实际目标是
WITH npt AS (
SELECT type, COUNT(*) n
FROM timeseriesentry
GROUP BY type
)
SELECT v.id
FROM timeseriesentry v
JOIN npt ON npt.type= v.type
WHERE random() < 200*1.0/npt.n
每种类型的样本量大致相同。
答案 0 :(得分:1)
这很难看,但确实有效。它还避免使用标识符type
,这是一个(未保留的)关键字。
WITH zzz AS (
SELECT ztype
, COUNT(*) AS cnt
FROM timeseriesentry
GROUP BY ztype)
SELECT *
FROM timeseriesentry src
WHERE random() < 20.0 / (SELECT cnt FROM zzz
WHERE zzz.ztype = src.ztype)
ORDER BY src.ztype
;
更新:与子查询中的窗口函数相同:
SELECT *
FROM (SELECT *
, sum(1) OVER (PARTITION BY ztype) AS cnt
FROM timeseriesentry
) src
WHERE random() < 20.0 / src.cnt
ORDER BY src.ztype
;
或者,更紧凑,同样的事情,但使用CTE:
WITH src AS(SELECT *
, sum(1) OVER (PARTITION BY ztype) AS cnt
FROM timeseriesentry
)
SELECT *
FROM src
WHERE random() < 20.0 / src.cnt
ORDER BY src.ztype
;
注意:CTE版本的性能不一定相同。事实上,他们往往更慢。 (因为在任何一种情况下,OQ实际上都需要访问所有 timeseriesentry表的行,所以在这种特殊情况下不会有太大差异)
答案 1 :(得分:1)
我创建了一个没有数字字段的表:
create table timeseriesentry as select generate_series('2015-01-01'::timestamptz,'2015-01-02'::timestamptz,'1 second'::interval) id, 'ret'::text v
;
并重复使用窗口聚合:
WITH tmp AS (SELECT round(count(*) over()*random()) n FROM timeseriesentry limit 20)
select id from
(SELECT row_number() over() rn,id
FROM timeseriesentry
) sel, tmp
WHERE rn =n
;
所以它给了&#34;随机&#34; 20:
2015-01-01 01:27:22+01
2015-01-01 03:33:51+01
2015-01-01 06:15:28+01
2015-01-01 09:52:21+01
2015-01-01 10:00:02+01
2015-01-01 10:08:33+01
2015-01-01 10:26:31+01
2015-01-01 12:55:21+01
2015-01-01 14:03:54+01
2015-01-01 14:05:36+01
2015-01-01 15:12:08+01
2015-01-01 15:45:55+01
2015-01-01 16:10:35+01
2015-01-01 17:11:02+01
2015-01-01 18:18:32+01
2015-01-01 19:35:51+01
2015-01-01 22:06:08+01
2015-01-01 22:12:42+01
2015-01-01 22:43:45+01
2015-01-01 22:49:55+01
答案 2 :(得分:0)
我猜我最接近的是:
WITH tmp AS (SELECT round(count(*) over()*random()) n FROM timeseriesentry limit 20)
SELECT id
FROM timeseriesentry, tmp
WHERE id=n