如果没有足够的观察结果,如何删除组?
在以下可重复的示例中,每个人(由name
标识)有10个观察结果:
install.packages('randomNames') # install package if required
install.packages('data.table') # install package if required
lapply(c('data.table', 'randomNames'), require, character.only = TRUE) # load packages
set.seed(1)
testDT <- data.table( date = rep(seq(as.Date("2010/1/1"), as.Date("2019/1/1"), "years"),10),
name = rep(randomNames(10, which.names='first'), times=1, each=10),
Y = runif(100, 5, 15),
X = rnorm(100, 2, 9),
testDT <- testDT[ X > 0]
现在我想只保留至少6个观察者,所以必须删除Gracelline,Anna,Aesha和Michael,因为他们有 分别只有3个,2个,4个和5个观察结果。
testDT[, length(X), by=name]
name V1
1: Blake 6
2: Alexander 6
3: Leigha 8
4: Gracelline 3
5: Epifanio 7
6: Keasha 6
7: Robyn 6
8: Anna 2
9: Aesha 4
10: Michael 5
如何以自动方式执行此操作(真实数据集要大得多)?
修改
是的,它是重复的。 :( 最后提出的方法是最快的方法。
> system.time(testDT[, .SD[.N>=6], by = name])
user system elapsed
0.293 0.227 0.517
> system.time(testDT[testDT[, .I[.N>=6], by = name]$V1])
user system elapsed
0.163 0.243 0.415
> system.time(testDT[,if(.N>=6) .SD , by = name])
user system elapsed
0.073 0.323 0.399
答案 0 :(得分:1)
我们按名称&#39;分组,得到nrow(.N
),if
大于6,我们Subset the Data.table(.SD
)。
testDT[,if(.N>=6) .SD , by = name]
# name date Y X
# 1: Blake 2010-01-01 9.820801 3.69913070
# 2: Blake 2012-01-01 9.935413 15.18999375
# 3: Blake 2013-01-01 6.862176 3.37928004
# 4: Blake 2014-01-01 13.273733 21.55350503
# 5: Blake 2015-01-01 11.684667 6.27958576
# 6: Blake 2017-01-01 6.079436 7.49653718
# 7: Alexander 2010-01-01 13.209463 4.62301612
# 8: Alexander 2012-01-01 12.829328 2.00994816
# 9: Alexander 2013-01-01 10.530363 2.66907192
#10: Alexander 2016-01-01 5.233312 0.78339246
#11: Alexander 2017-01-01 9.772301 12.60278297
#12: Alexander 2019-01-01 11.927316 7.34551569
#13: Leigha 2010-01-01 9.776196 4.99655334
#14: Leigha 2011-01-01 13.612095 11.56789854
#15: Leigha 2013-01-01 7.447973 5.33016929
#16: Leigha 2014-01-01 5.706790 4.40388912
#17: Leigha 2016-01-01 8.162717 12.87081025
#18: Leigha 2017-01-01 10.186343 12.44362354
#19: Leigha 2018-01-01 11.620051 8.30192285
#20: Leigha 2019-01-01 9.068302 16.28150109
#21: Epifanio 2010-01-01 8.390729 17.90558542
#22: Epifanio 2011-01-01 13.394404 8.45036728
#23: Epifanio 2012-01-01 8.466835 10.19156807
#24: Epifanio 2013-01-01 8.337749 5.45766822
#25: Epifanio 2014-01-01 9.763512 17.13958472
#26: Epifanio 2017-01-01 8.899895 14.89054015
#27: Epifanio 2019-01-01 14.606180 0.13357331
#28: Keasha 2013-01-01 8.253522 6.44769498
#29: Keasha 2014-01-01 12.570871 0.40402566
#30: Keasha 2016-01-01 12.111212 14.08734943
#31: Keasha 2017-01-01 6.216919 0.06878532
#32: Keasha 2018-01-01 7.454885 0.38399123
#33: Keasha 2019-01-01 6.433044 1.09828333
#34: Robyn 2010-01-01 7.396294 8.41399676
#35: Robyn 2011-01-01 5.589344 1.33792036
#36: Robyn 2012-01-01 11.422883 1.66129246
#37: Robyn 2015-01-01 12.973088 2.54144396
#38: Robyn 2017-01-01 9.100841 6.78346573
#39: Robyn 2019-01-01 11.049333 4.75902075
或者代替if
,我们可以直接使用.N>1
并使用`.SD
testDT[, .SD[.N>=6], by = name]
它可能有点慢,所以另一个选项是.I
来获取行索引然后是子集
testDT[testDT[, .I[.N>=6], by = name]$V1]