如果没有足够的观察结果,如何删除组?

时间:2016-01-16 18:08:30

标签: r data.table

如果没有足够的观察结果,如何删除组? 在以下可重复的示例中,每个人(由name标识)有10个观察结果:

install.packages('randomNames') # install package if required
install.packages('data.table')  # install package if required
lapply(c('data.table', 'randomNames'), require, character.only = TRUE) # load packages

set.seed(1)
testDT <- data.table( date = rep(seq(as.Date("2010/1/1"), as.Date("2019/1/1"), "years"),10),
                      name = rep(randomNames(10, which.names='first'), times=1, each=10),
                      Y    =  runif(100, 5, 15),
                      X    =  rnorm(100, 2, 9),
testDT <- testDT[ X > 0]

现在我想只保留至少6个观察者,所以必须删除Gracelline,Anna,Aesha和Michael,因为他们有  分别只有3个,2个,4个和5个观察结果。

  testDT[, length(X), by=name]
            name V1
   1:      Blake  6
   2:  Alexander  6
   3:     Leigha  8
   4: Gracelline  3
   5:   Epifanio  7
   6:     Keasha  6
   7:      Robyn  6
   8:       Anna  2
   9:      Aesha  4
  10:    Michael  5

如何以自动方式执行此操作(真实数据集要大得多)?

修改

是的,它是重复的。 :( 最后提出的方法是最快的方法。

> system.time(testDT[, .SD[.N>=6], by = name])
   user  system elapsed 
  0.293   0.227   0.517 
> system.time(testDT[testDT[, .I[.N>=6], by = name]$V1])
   user  system elapsed 
  0.163   0.243   0.415 
> system.time(testDT[,if(.N>=6) .SD , by = name])
   user  system elapsed 
  0.073   0.323   0.399 

1 个答案:

答案 0 :(得分:1)

我们按名称&#39;分组,得到nrow(.N),if大于6,我们Subset the Data.table(.SD )。

testDT[,if(.N>=6) .SD , by = name]
#       name       date         Y           X
# 1:     Blake 2010-01-01  9.820801  3.69913070
# 2:     Blake 2012-01-01  9.935413 15.18999375
# 3:     Blake 2013-01-01  6.862176  3.37928004
# 4:     Blake 2014-01-01 13.273733 21.55350503
# 5:     Blake 2015-01-01 11.684667  6.27958576
# 6:     Blake 2017-01-01  6.079436  7.49653718
# 7: Alexander 2010-01-01 13.209463  4.62301612
# 8: Alexander 2012-01-01 12.829328  2.00994816
# 9: Alexander 2013-01-01 10.530363  2.66907192
#10: Alexander 2016-01-01  5.233312  0.78339246
#11: Alexander 2017-01-01  9.772301 12.60278297
#12: Alexander 2019-01-01 11.927316  7.34551569
#13:    Leigha 2010-01-01  9.776196  4.99655334
#14:    Leigha 2011-01-01 13.612095 11.56789854
#15:    Leigha 2013-01-01  7.447973  5.33016929
#16:    Leigha 2014-01-01  5.706790  4.40388912
#17:    Leigha 2016-01-01  8.162717 12.87081025
#18:    Leigha 2017-01-01 10.186343 12.44362354
#19:    Leigha 2018-01-01 11.620051  8.30192285
#20:    Leigha 2019-01-01  9.068302 16.28150109
#21:  Epifanio 2010-01-01  8.390729 17.90558542
#22:  Epifanio 2011-01-01 13.394404  8.45036728
#23:  Epifanio 2012-01-01  8.466835 10.19156807
#24:  Epifanio 2013-01-01  8.337749  5.45766822
#25:  Epifanio 2014-01-01  9.763512 17.13958472
#26:  Epifanio 2017-01-01  8.899895 14.89054015
#27:  Epifanio 2019-01-01 14.606180  0.13357331
#28:    Keasha 2013-01-01  8.253522  6.44769498
#29:    Keasha 2014-01-01 12.570871  0.40402566
#30:    Keasha 2016-01-01 12.111212 14.08734943
#31:    Keasha 2017-01-01  6.216919  0.06878532
#32:    Keasha 2018-01-01  7.454885  0.38399123
#33:    Keasha 2019-01-01  6.433044  1.09828333
#34:     Robyn 2010-01-01  7.396294  8.41399676
#35:     Robyn 2011-01-01  5.589344  1.33792036
#36:     Robyn 2012-01-01 11.422883  1.66129246
#37:     Robyn 2015-01-01 12.973088  2.54144396
#38:     Robyn 2017-01-01  9.100841  6.78346573
#39:     Robyn 2019-01-01 11.049333  4.75902075

或者代替if,我们可以直接使用.N>1并使用`.SD

包装
testDT[, .SD[.N>=6], by = name]

它可能有点慢,所以另一个选项是.I来获取行索引然后是子集

testDT[testDT[, .I[.N>=6], by = name]$V1]