正如标题所示,我想拍摄一些测试数据&把它塑造成令人信服的demographic sample.
在centile rankings
的表中,我有一个任意的分布模式,21行描述以5为间隔设置的百分位数,用于描述排名的frequency distribution
。
通过这种方式,人口统计分析可以根据品味进行建模,但是这种技术可以应用于任何类型的模拟,具有任意数量的分层频率分布。如果结果数据过于细化和需要更多的原子性,然后可以创建适合每个百分位范围的随机值。
centile frequency
-------- -----------
0 0
5 1
10 2
/~~~~~~~~~~~/
40 7
45 8
50 8
55 9
60 8
/~~~~~~~~~~~/
90 3
95 2
100 1
作为最简单的情况,我想将此分布填充到预先存在的一组测试数据中("学生记录")随机分配每个百分位分组( 80,85,90th ......)到相应的学生记录(10名学生,5名学生,3名学生......)。
id lname fname dob centile
----------- ---------- ---------- ---------- --------
1 Bender Brooke 2016-10-07 5
2 Chan Raya 2016-07-27 10
3 Acosta Jared 2017-02-15 10
/~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~/
98 Maddox Cassady 2016-09-01 95
99 Mcdonald Heather 2018-02-20 95
100 Todd Sydnee 2017-03-12 100
在实践中,我希望定位任意数量的学生记录,并按比例分配百分位数。每组21个频率分布都会显示出不同的模式,甚至是高度偏斜的分布(指数,威布尔,拉普拉斯),每个都定制为预定义的用户设置模式。
DDL:
CREATE TABLE centiles
(
frequency int
, centile int
) ;
CREATE TABLE student (
id int IDENTITY(1,1)
,lastname varchar(255)
,firstname varchar(255)
,dob date );
INSERT INTO centiles
(centile, frequency)
VALUES
( 0 , 0 )
,( 5 , 1 )
,( 10 , 2 )
,( 15 , 3 )
,( 20 , 3 )
,( 25 , 4 )
,( 30 , 5 )
,( 35 , 6 )
,( 40 , 7 )
,( 45 , 8 )
,( 50 , 8 )
,( 55 , 9 )
,( 60 , 8 )
,( 65 , 8 )
,( 70 , 7 )
,( 75 , 6 )
,( 80 , 5 )
,( 85 , 4 )
,( 90 , 3 )
,( 95 , 2 )
,( 100 , 1 );
INSERT INTO Students
( [LastName],[FirstName],[dob])
VALUES
( 'Bender','Brooke','2016-10-07')
, ( 'Chan','Raya','2016-07-27')
, ( 'Acosta','Jared','2017-02-15')
, ( 'Chase','Leah','2017-09-05')
, ( 'Jefferson','Giselle','2016-09-15')
, ( 'Paul','Sage','2017-04-02')
, ( 'Mckinney','Shaine','2018-02-15')
, ( 'Key','Bertha','2016-05-12')
, ( 'Donovan','Morgan','2016-10-25')
, ( 'Graves','Gil','2016-07-14')
, ( 'Chan','Hilel','2016-08-02')
, ( 'Davenport','Mollie','2017-04-08')
, ( 'Mccoy','Ayanna','2016-07-18')
, ( 'Head','Camden','2016-06-25')
, ( 'Hickman','Risa','2016-05-23')
, ( 'Salazar','Ivy','2017-05-22')
, ( 'Hyde','Kane','2017-06-12')
, ( 'Allen','Carol','2018-01-09')
, ( 'Quinn','Phillip','2016-12-21')
, ( 'Pollard','Aristotle','2017-06-16')
, ( 'Hinton','Colorado','2017-02-09')
, ( 'Howard','Nehru','2018-02-03')
, ( 'Chambers','Hillary','2016-09-08')
, ( 'Padilla','Warren','2017-05-29')
, ( 'Rutledge','Plato','2016-07-31')
, ( 'Goodman','Serina','2017-12-07')
, ( 'Bean','Stewart','2017-04-10')
, ( 'Tran','Sacha','2016-10-15')
, ( 'Schroeder','Kai','2017-10-04')
, ( 'Cooper','Phyllis','2016-11-27')
, ( 'Pierce','Madeline','2018-02-16')
, ( 'Lee','Kibo','2018-03-22')
, ( 'Robles','Libby','2016-09-03')
, ( 'Riley','Veronica','2018-03-03')
, ( 'Booth','Wynter','2018-04-09')
, ( 'Bird','Eugenia','2017-04-06')
, ( 'Morton','Ryder','2016-10-14')
, ( 'Tanner','Paloma','2017-08-25')
, ( 'Powers','Colton','2018-03-05')
, ( 'Mccarthy','Roth','2017-04-17')
, ( 'Floyd','Neve','2017-08-15')
, ( 'Mcneil','Ria','2017-11-18')
, ( 'Hoffman','Odessa','2018-03-26')
, ( 'Christian','Vanna','2016-05-16')
, ( 'Mercer','Madison','2017-01-31')
, ( 'Franks','Angela','2016-07-31')
, ( 'Obrien','Desirae','2016-08-03')
, ( 'Walls','Elmo','2017-02-25')
, ( 'Flores','Hakeem','2016-09-12')
, ( 'Waller','Demetrius','2018-02-28')
, ( 'Savage','Mara','2018-02-02')
, ( 'Wilkerson','Germane','2018-01-23')
, ( 'Ramirez','Aphrodite','2017-05-31')
, ( 'Fischer','Amery','2017-07-19')
, ( 'Sweeney','Upton','2017-01-18')
, ( 'Joyner','Simon','2017-11-18')
, ( 'Dunn','Logan','2017-04-14')
, ( 'Tyler','Shannon','2017-05-27')
, ( 'Dillard','Fritz','2016-12-28')
, ( 'Moran','Rooney','2017-12-08')
, ( 'Logan','Hunter','2016-11-06')
, ( 'Gamble','Talon','2017-04-08')
, ( 'Mckay','Quon','2017-08-22')
, ( 'Livingston','Wylie','2017-02-21')
, ( 'Hensley','Quincy','2018-01-08')
, ( 'Mcmahon','Meredith','2018-04-26')
, ( 'Flowers','Zachery','2018-01-29')
, ( 'Shepherd','Cairo','2017-01-25')
, ( 'Sweet','Sarah','2017-10-30')
, ( 'Newton','Calvin','2017-07-22')
, ( 'Cameron','Paloma','2016-09-07')
, ( 'Combs','Warren','2017-01-14')
, ( 'Ayala','Gary','2018-04-16')
, ( 'Beard','Shellie','2018-01-02')
, ( 'Witt','Anthony','2017-09-14')
, ( 'Garner','Quon','2016-06-12')
, ( 'Petersen','Maris','2017-11-20')
, ( 'Noble','Igor','2018-03-18')
, ( 'Adkins','Isaiah','2017-03-20')
, ( 'Mcclain','Gillian','2016-09-01')
, ( 'Henson','Bert','2016-06-30')
, ( 'Randall','Zeus','2018-02-26')
, ( 'Hart','Christine','2017-05-31')
, ( 'Carter','Jocelyn','2017-05-10')
, ( 'Mcfadden','Celeste','2018-03-11')
, ( 'Contreras','Abbot','2017-04-05')
, ( 'Kerr','Uriel','2016-05-06')
, ( 'Wood','Sybil','2016-12-14')
, ( 'Armstrong','Ethan','2017-09-20')
, ( 'Morse','Rae','2018-01-25')
, ( 'York','Irene','2018-04-30')
, ( 'Garrison','Thor','2016-06-20')
, ( 'Pace','Harlan','2017-02-02')
, ( 'Cleveland','Kylan','2016-06-18')
, ( 'Stanley','Roth','2016-10-28')
, ( 'Kemp','Alan','2016-11-04')
, ( 'Stewart','Frances','2017-12-13')
, ( 'Maddox','Cassady','2016-09-01')
, ( 'Mcdonald','Heather','2018-02-20')
, ( 'Todd','Sydnee','2017-03-12')
;
答案 0 :(得分:0)
因此,计算值的累积分布。使用它来计算每个百分位数的分布的下限和上限。然后棘手的部分是在SQL Server中计算一个随机数。
但其余的只是一个加入:
with c as (
select c.*,
((sum(frequency) over (order by centile) - frequency) / sum(frequency * 1.0) over ()) as lb,
(sum(frequency) over (order by centile) / sum(frequency * 1.0) over ()) as ub
from centiles c
)
select s.*, c.centile
from (select s.*,
rand(checksum(newid())) as randish
from students s
) s join
c
on s.randish >= c.lb and s.randish < s.ub;
我应该注意:除rand(checksum(newid()))
之外,可能还有其他方法可以生成随机数。这在实践中运作良好。