如何从数据框中的一组变量中有效地选择变量的随机样本

时间:2019-02-19 14:46:34

标签: r function random data.table tidyr

我希望能为您随机选择var.w_X的子集提供帮助 包含样本数据510 var.w_X个变量中的sampleDT个,同时保留了所有其他不以var.w_开头的变量。

下面是示例数据sampleDT,其中包含其他变量(应一起保留),X变量,其名称中以var.w_开头的变量(从中提取变量随机样品)。

在当前示例中,X=10使得var.w_包括var.w_1var.w_10,并且我想从中抽取5的随机样本这些10。但是,在我的实际数据中,X>1,000,000和我可能想从这些7,500中抽取var.w_ X>1,000,000个变量的样本。

因此,在任何给定的解决方案中,效率都是至关重要的,因为recently遇到了mutate_at的一些性能问题,其原因我仍然没有解释。

重要的是,要保留的其他变量(不以var.w_开头的变量)不能保证以任何预定的顺序保留,因为它们可能位于变量之前和/或之间和/或之后。例如var.w_个变量。因此,依赖列顺序的解决方案将不起作用。

#sample数据

sampleDT<-structure(list(n = c(62L, 96L, 17L, 41L, 212L, 143L, 143L, 143L, 
73L, 73L), r = c(3L, 1L, 0L, 2L, 170L, 21L, 0L, 33L, 62L, 17L
), p = c(0.0483870967741935, 0.0104166666666667, 0, 0.0487804878048781, 
0.80188679245283, 0.146853146853147, 0, 0.230769230769231, 0.849315068493151, 
0.232876712328767), var.w_8 = c(1.94254385942857, 1.18801169942857, 
3.16131123942857, 3.16131123942857, 1.13482609242857, 1.13042157942857, 
2.13042157942857, 1.13042157942857, 1.12335579942857, 1.12335579942857
), var.w_9 = c(1.942365288, 1.187833128, 3.161132668, 3.161132668, 
1.134647521, 1.130243008, 2.130243008, 1.130243008, 1.123177228, 
1.123177228), var.w_10 = c(1.94222639911111, 1.18769423911111, 
3.16099377911111, 3.16099377911111, 1.13450863211111, 1.13010411911111, 
2.13010411911111, 1.13010411911111, 1.12303833911111, 1.12303833911111
), group = c(1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 
0L, 0L), treat = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L), c1 = c(1.941115288, 
1.186583128, 1.159882668, 1.159882668, 1.133397521, 1.128993008, 
1.128993008, 1.128993008, 1.121927228, 1.121927228), var.w_6 = c(1.939115288, 1.184583128, 
3.157882668, 3.157882668, 1.131397521, 1.126993008, 2.126993008, 
1.126993008, 1.119927228, 1.119927228), var.w_7 = c(1.94278195466667, 
1.18824979466667, 3.16154933466667, 3.16154933466667, 1.13506418766667, 
1.13065967466667, 2.13065967466667, 1.13065967466667, 1.12359389466667, 
1.12359389466667), c2 = c(0.1438, 
0.237, 0.2774, 0.2774, 0.2093, 0.1206, 0.1707, 0.0699, 0.1351, 
0.1206), var.w_1 = c(1.941115288, 1.186583128, 3.159882668, 3.159882668, 
1.133397521, 1.128993008, 2.128993008, 1.128993008, 1.121927228, 
1.121927228), var.w_2 = c(1.931115288, 1.176583128, 3.149882668, 
3.149882668, 1.123397521, 1.118993008, 2.118993008, 1.118993008, 
1.111927228, 1.111927228), var.w_3 = c(1.946115288, 1.191583128, 
3.164882668, 3.164882668, 1.138397521, 1.133993008, 2.133993008, 
1.133993008, 1.126927228, 1.126927228), var.w_4 = c(1.93778195466667, 
1.18324979466667, 3.15654933466667, 3.15654933466667, 1.13006418766667, 
1.12565967466667, 2.12565967466667, 1.12565967466667, 1.11859389466667, 
1.11859389466667), var.w_5 = c(1.943615288, 1.189083128, 3.162382668, 
3.162382668, 1.135897521, 1.131493008, 2.131493008, 1.131493008, 
1.124427228, 1.124427228)), class = "data.frame", row.names = c(NA, -10L))

#我的尝试

//based on the comment by @akrun - this does not keep the other variables as specified above

myvars <- sample(grep("var\\.w_", names(sampleDT), value = TRUE), 5)
sampleDT_test <- sampleDT[myvars]

在此先感谢您的帮助

1 个答案:

答案 0 :(得分:1)

抱歉,必须参加一点会议。因此,我认为您可以采用akrun的解决方案,并保留示例数据帧的第一列。让我知道它如何在整个数据帧上缩放。另外,感谢您进一步澄清。

> # Subsetting the variable names not matching your pattern using grepl
> names(sampleDT)[!grepl("var\\.w_", names(sampleDT))]
[1] "n"     "r"     "p"     "group" "treat" "c1"    "c2"   
> 
> # Combine that with akrun's solution 
> myvars <- c(names(sampleDT)[!grepl("var\\.w_", names(sampleDT))],
+             sample(grep("var\\.w_", names(sampleDT), value = TRUE), 5))
> head(sampleDT[myvars])
    n   r          p group treat       c1     c2  var.w_6  var.w_1  var.w_4  var.w_3  var.w_8
1  62   3 0.04838710     1     0 1.941115 0.1438 1.939115 1.941115 1.937782 1.946115 1.942544
2  96   1 0.01041667     1     0 1.186583 0.2370 1.184583 1.186583 1.183250 1.191583 1.188012
3  17   0 0.00000000     0     0 1.159883 0.2774 3.157883 3.159883 3.156549 3.164883 3.161311
4  41   2 0.04878049     1     0 1.159883 0.2774 3.157883 3.159883 3.156549 3.164883 3.161311
5 212 170 0.80188679     0     0 1.133398 0.2093 1.131398 1.133398 1.130064 1.138398 1.134826
6 143  21 0.14685315     1     1 1.128993 0.1206 1.126993 1.128993 1.125660 1.133993 1.130422