Question

我有以下问题：我的table1有N个正样本，并且随着时间的推移逐渐增长。我想从另一个巨大的表中选择10N个负样本。所以它会是这样的：

WITH positive_samples AS (
  SELECT * FROM table1
), negative_samples AS (
  SELECT * FROM table2 LIMIT 100 
)

此查询存在一些问题：它不能保证我的negative_samples比positive_samples多10倍，并且它不会随机选择负样本。

在Hive或Presto中选择这两个集合的正确查询是什么？

Answer 1

一个算法在HIVE中得到你想要的输出是这样的：

R1 =随机化负数据集 R2 =为此R1分配行号 CP =创建一个包含一行和一列包含POSTIVIE行数的表。调用列postive_cnt。 J =采用R2和CP的笛卡尔积。 FINAL =从J中选择行，其中row_number＆lt; =（positive_cnt * 10）

实际查询（在某些数据集上测试）：

with 
pcount as ( select count(*) as positive_cnt from POSITIVE)
,
nrandom as( select * from NEGATIVE order by rand())
,
nrandom_row_num as ( select *, row_number() over() as row_number from nrandom )
,
jnd as (select * from nrandom_row_num, pcount)
select * from jnd
where row_number <= (positive_cnt * 10);

查询之间的定量依赖性

1 个答案: