我有一个数据集,我想使用postgres sql将它分成70:30比例进入训练和测试集。我怎样才能做到这一点。我使用了以下代码,但它似乎无法正常工作
create table training_test as
(
WITH TEMP as
(
SELECT ROW_NUMBER() AS ROW_ID , Random() as RANDOM_VALUE,D.*
FROM analytics.model_data_discharge_v1 as D
ORDER BY RANDOM_VALUE
)
SELECT 'Training',T.* FROM TEMP T
WHERE ROW_ID <= 493896*0.70
UNION
SELECT 'Test',T.* FROM TEMP T
WHERE ROW_ID > 493896*0.70
) distributed by(hospitalaccountrecord);
答案 0 :(得分:2)
select t.*,
case
when random() < 0.7 then 'training'
else 'test'
end as split
from analytics.model_data_discharge_v1 t
答案 1 :(得分:0)
如果要分层拆分,可以使用以下代码。
第一位保证每个组都有要拆分的最小大小。
with ssize as (
select
group
from to_split_table
group by group
having count(*) >= {{ MINIMUM GROUP SIZE }}) -- {{ MINIMUM GROUP SIZE }} = 1 / {{ TEST_THRESHOLD }}
select
id_aux,
ts.group,
case
when
cast(row_number() over (partition by ts.group order by rand()) as double) / cast(count() over (partition by ts.group) as double)
< {{ TEST_THRESHOLD }} then 'test'
else 'train'
end as splitting
from to_split_table ts
join ssize
on ts.group = ssize.group
答案 2 :(得分:0)
不使用随机拆分是不可重复的!每次random()
都会返回不同的结果。
例如,您可以按照Google Cloud的建议,使用哈希和模来拆分数据集。
result < 8
,则它将成为80%训练集的一部分result == 8
,则它将成为20%测试集的一部分使用BiQuery的示例(我来自GCP ML课程):
这样,您每次都能获得准确的80%数据。