Question

我有一个跨国数据集，每个受访者至少有一本日记。每位受访者的日记数量和日记完成日因国家/地区而异。

例如，在一个国家/地区，每个受访者只完成了一本日记（一半的受访者仅在周末完成，而另一半仅在工作日完成）。在另一个国家，每个受访者完成了2本日记（一个周末 - 一个工作日），而在另一个国家，每个人都完成了7本日记（每周一天）。还有一些调查显示，一些受访者返回了2本日记，而其他人则返回3本;还有一个人每个人都回来了4本日记。数据如下所示：

country_id<-rep(1:4,c(8,8,14,10))
diarist_id<-c(11:18,rep(21:24,each=2),
              rep(31:32,each=7),
              rep(41:44,c(3,3,2,2)))
diary_id<-c(111:118,211,212,221,222,231,232,241,242,
            311:317,321:327,411,412,413,
            421,422,423,431,432,441,442)
weekend<-c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,
           0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,
           0,1,0,1,0,1,0,1,0)

dat<-data.frame(country_id,diarist_id,diary_id,weekend)

我试图从每个国家随机抽取“一人一日记”。但在国家层面，我需要 - 大约29％的日记是周末日记。如何按组绘制这样的条件随机样本？

Answer 1

我认为这会得到你所追求的。为了清楚起见，我选择拆分样品;可能有一种方法可以在不这样做的情况下得到你想要的东西，但它并没有找到我。

我将使用data.table：

set.seed(100)
library(data.table)
setDT(dat) #turn dat into a data.table (by reference)
country_n<-5 #how many observations you'd like per country

#split the data by weekend status
weekend.dat<-dat[weekend==T]
#we have to take care that there are actually enough
#  weekend observations in each country, so we take the
#  minimum of 29% of country_n (rounded) and the total
#  number of weekend observations in that country
weekend.sample<-
  weekend.dat[weekend.dat[,.I[sample(.N,min(round(.29*country_n),.N))],
                          by=country_id]$V1]

#repeat for the weekday sample, except take 71% this time
weekday.dat<-dat[weekend==F]
weekday.sample<-
  weekday.dat[weekday.dat[,.I[sample(.N,min(round(.71*country_n),.N))],
                          by=country_id]$V1]

#combine; setkey orders the data (as well as other
#  things that may be useful later on)
full.sample<-setkey(rbindlist(list(weekend.sample,weekday.sample)),
                    country_id,diarist_id,diary_id)

这是为我给定的随机种子生成的样本

> full.sample
    country_id diarist_id diary_id weekend
 1:          1         12      112       0
 2:          1         13      113       1
 3:          1         14      114       0
 4:          1         16      116       0
 5:          1         18      118       0
 6:          2         21      212       0
 7:          2         22      221       1
 8:          2         22      222       0
 9:          2         23      232       0
10:          2         24      242       0
11:          3         31      315       0
12:          3         31      316       0
13:          3         31      317       0
14:          3         32      321       1
15:          3         32      324       0
16:          4         41      411       1
17:          4         42      421       0
18:          4         42      423       0
19:          4         43      432       0
20:          4         44      442       0

如何提取具有多个不同组的条件的随机样本？

1 个答案: