Question

（从AWS论坛交叉发布此消息......）

需要相当大的虚拟数据块。我使用了这个英文单词列表：http://www.mieliestronk.com/corncob_lowercase.txt

我看到在Amazon Redshift中CTE中涉及random（）函数的看似等效查询的结果数量差异很大。（我试图采用随机样本 - 一个查询按预期返回实际样本，另一个基本上只返回我试图采样的整个项目列表。）

有人可以看一下吗？难道我做错了什么？还有其他问题吗？

/* Create tables to hold words */

create table main_words(word varchar(max));
create table couple_words(word varchar(max));

/* Get some words */

copy main_words
    from 'S3 LOCATION OF CORNCOB FILE'
    credentials 'aws_access_key_id=ID;aws_secret_access_key=KEY' 
    csv;

/* Put a few in another table */

insert into
    couple_words
select top 5000
    word
from
    main_words;

/* Returns about 500 results */

with the_cte as
    (
        select
            word,
            random() as random_value
        from
            main_words
        where
            word not in (select word from couple_words)
    )
select
    count(*)
from
    the_cte
where
    random_value > .99;


/* Returns about 58,000 results (basically, the whole list) */

with the_cte as
    (
        select
            word
        from
            main_words
        where
            word not in (select word from couple_words)
            and random() > .99
    )
select
    count(*)
from
    the_cte;

/* Clean up */

drop table if exists main_words;
drop table if exists couple_words;

Answer 1

您是否在其他服务器上试用过它？

我只需在 SqlFidle 上创建一个样本，其中包含100行加random() > 0.9，结果非常相似。

第一次CTE

| count |
|-------|
|     4 |

第二次CTE

| count |
|-------|
|    13 |

10次运行的平均计数（*）

CTE 1     CTE 2
 8.3       9.8

Answer 2

我怀疑一些时髦的查询重写。如果你必须有内部查询，你可以在里面使用LIMIT 2147483647，看看会出现什么。

Redshift CTE中的Random（）在某些条件下返回非常不正确的结果

2 个答案: