问题Deleting random subset of observations within a group of variables that have a certain value略有不同。 我正在寻找的变化是如何删除行的子集,其中删除的行数每次分组标准更改时都会更改。这是一个简单的示例数据集,其中包含一列数值和一个数字分组列(分组列也可以是“AA1”,“AA2”等因素)。
set.seed(23)
df<-data.frame(a=round(rnorm(500,mean=20,sd=2)))
df$group<-seq(from = 1, to = length (df),by=5)
数据表(表格(df $ a)给出了这个结果:
group: 14 15 16 17 18 19 20 21 22 23 24 25
count: 1 7 13 24 65 87 91 91 59 42 12 8
例如:当分组值等于15时,我想随机删除4行;当group = 16时,随机删除7行;当group = 17时,随机删除7行。对每个分组变量继续此过程。
这是我目前的解决方案:
(dfindex<-which(df$a==15)) ##create index that meets the grouping variable criteria
(delete.df.index<-sample(dfindex,4)) ##select number of rows to randomly remove
dfnew<-df[-delete.df.index,] ##create a new data frame and delete the randomly selected rows
在新创建的数据框上重复上述步骤:
(dfindex<-which(dfnew$a==16)) ##create another index from the grouping variable criteria
(delete.df.index<-sample(dfindex,3)) ##select rows to randomly delete
dfnew<-dfnew[-delete.df.index,] ##delete rows
重复分组变量和随机选择的行的每个组合以删除。
(dfindex<-which(dfnew$a==17))
(delete.df.index<-sample(dfindex,7))
dfnew<-dfnew[-delete.df.index,]
通过这个例子,我有12个分组级别。简单但耗时的方法是复制/粘贴/编辑分组变量和行删除的每个组合的代码。我想知道是否可以使用表(或类似的东西)来指定要为特定分组变量删除的分组值和行数:
要删除的组和行的示例表。
Group Number of rows to randomly remove
14 0
15 4
16 3
17 7
18 40
19 23
提前感谢任何输入。
答案 0 :(得分:0)
尝试运行 -
set.seed(23)
df<-data.frame(a=round(rnorm(50,mean=20,sd=2)))
# create table of no of rows that need to be removed per each a
noofrowsremove <- read.table(textConnection(
'a toremove
21 1
23 2
15 2
17 1
19 2
20 2
24 2
16 1
22 1
18 3'), header = TRUE)
library(data.table)
# assign random number in a new column, this will help in sampling
df$tosample <- runif(50)
# convert data.frame to data.table, grouped operations are easier on data.table
dt <- data.table(df)
# rank the tosample column within each unique a value
dt[,samplerank := rank(tosample), by = 'a']
# merge the filtering no of rows with dt
dt <- merge(dt,noofrowsremove, by = 'a')
# filter out rows that have samplerank columns <= the no of rows that need to be removed
dttrimmed <- dt[samplerank > toremove]
答案 1 :(得分:0)
在完成Codoremifa提供的答案后,我注意到一些小细节可能值得为其他人找到这篇文章进行记录。使用Codoremifa提供的答案,我进行了一些小的更改,并包含一些额外的代码来说明一些重要的细节。基本上,请注意合并步骤并决定如何处理合并步骤生成的NA值。
set.seed(23)
df<-data.frame(a=round(rnorm(50,mean=20,sd=2)))
# create table of no of rows that need to be removed per each a
noofrowsremove <- read.table(textConnection(
'a toremove
21 0
17 1
19 2
20 2
24 2
16 1
22 1
18 3'), header = TRUE)
##excluded values 23 and 15 from the above df to illustrate an example below
#Keep value 21 and just assigned it a 0 (i.e., do not remove any values of 21).
library(data.table)
# assign random number in a new column, this will help in sampling
df$tosample <- runif(50) #can also use runif(nrow(df))
# convert data.frame to data.table, grouped operations are easier on data.table
dt <- data.table(df)
# rank the tosample column within each unique a value
dt[,samplerank := rank(tosample), by = 'a']
# merge the filtering no of rows with dt. Be careful with merge options.
dt1 <- merge(dt,noofrowsremove, by = 'a') #46 rows
dt2 <- merge(dt,noofrowsremove, by = 'a',all=TRUE) #51 rows.
#Notice slight differences in the number of rows between dt1 and dt2
#In dt2, value 23 in the toremove column is "NA" because 23 was not included in noofrowsremove
nrow(dt1) #46 rows
nrow(dt2) #51 rows
##to keep values with "NA" change the "NA" to a 0
dt2$toremove[is.na(dt2$toremove)] <- 0 #assign NA to 0
# filter out rows that have samplerank columns <= the no of rows that need to be removed
dttrimmed1 <- dt1[samplerank > toremove] #36 rows. toremove values with NA are exlcuded
dttrimmed2 <- dt2[samplerank > toremove] #40 rows. Kept values with NA reasigned to 0