如何在没有替换的情况下绘制以填充数据集

时间:2015-09-26 00:08:53

标签: random stata

我正在生成一个数据集,我首先要从离散分布中为每个观察值随机绘制一个数字,用这些数字填充var1。接下来,我想从每行的分布中绘制另一个数字,但问题是此观察的var1中的数字不再有资格被绘制。我想重复一次这个问题。

为了让这更有意义,假设我开始:

id
1
2
3
...
999
1000

假设我所拥有的分布是[“A”,“B”,“C”,“D”,“E”],其概率为[。2,。3,。1,。15,.25} ]

我首先要从此分布中随机抽取以填充var。假设结果是:

id    var1
1     E
2     E
3     C
...   
999   B
1000  A

现在E没有资格为观察12绘制。 CBA分别不符合观察39991000的条件。

填写完所有列后,我们最终可能会这样:

id    var1  var2  var3  var4  var5
1     E     C     B     A     D
2     E     A     B     D     C
3     C     B     A     E     D
...        
999   B     D     C     A     E
1000  A     E     B     C     D

我不确定如何在Stata中解决这个问题。但填写var1的一种方法是执行以下操作:

gen random1 = runiform()
replace var1 = "A" if random1<.2
replace var1 = "B" if random1>=.2 & random1<.5
etc....

请注意,在创建var1之后坚持使用(缩放)概率是可取的,但对我来说并不是必需的。

1 个答案:

答案 0 :(得分:2)

这是一个以长形式工作的解决方案,可以从分发中进行选择。选择值后,它们将标记为已完成,下一个选择将从包含其余值的组中进行。概率在每次通过时都会缩放。

version 14
set seed 3241234

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte ip str1 y double p
1 "A"  .2
2 "B"  .3
3 "C"  .1
4 "D" .15
5 "E" .25
end

local nval = _N

* the following should be true
isid y

expand 1000
bysort y: gen id = _n
sort id ip

gen done = 0

forvalues i = 1/`nval' {

    // scale probabilities
    bysort id done (ip): gen double ptot = sum(p)   // this is a running sum
    by id done: gen double phigh = sum(p / ptot[_N])
    by id done: gen double plow = cond(_n == 1, 0, phigh[_n-1])

    // random number in the range of (0,1) for the group
    bysort id done (ip): gen double x = runiform()

    // pick from the not done group; choose first x to represent group
    by id done: gen pick = !done & inrange(x[1], plow, phigh)

    // put the picked obs at the end and create the new var
    bysort id (pick ip): gen v`i' = y[_N]

    // we are done for the obs that was picked
    bysort id: replace done = 1 if _n == _N

    drop x pick ptot phigh plow
}

bysort id: keep if _n == 1