我有一个包含68列的数据框。我想根据优先级排序的变量名称向量动态检查无效数据。如果任何设置字段为NA,我希望将这些行移动到新数据框,该数据框将包含一个包含排除原因的附加列。
示例数据框(只有5列):
df1=data.frame(id=c(1:6),
dob=as.Date(c("1/1/2001","2/2/2002",NA,"3/3/2003","1/1/1999",NA),"%m/%d/%Y"),
sex=c("F","F","M",NA,NA,"M"),
race=c("HA","HA","W","AA",NA,NA),
survey=c("1",NA,NA,NA,"1","0"))
我希望能够动态定义required_cols
。如果required_cols
是:
required_cols<-c("sex","race")
我希望将性别和种族为NA的df1
中的所有行移动到如下所示的输出表中:
id dob sex race survey reason
4 2003-03-03 <NA> AA <NA> sex
5 1999-01-01 <NA> <NA> 1 sex
6 <NA> M <NA> 0 race
并将原始表更新为:
id dob sex race survey
1 2001-01-01 F HA 1
2 2002-02-02 F HA <NA>
3 <NA> M W <NA>
如果required_cols是required_cols<-c("sex","survey")
,我希望输出表为:
id dob sex race survey reason
2 2 2002-02-02 F HA <NA> survey
3 3 <NA> M W <NA> survey
4 4 2003-03-03 <NA> AA <NA> survey
5 5 1999-01-01 <NA> <NA> 1 sex
和原始表:
id dob sex race survey
1 1 2001-01-01 F HA 1
6 6 <NA> M <NA> 0
我可以使用complete.cases
更新原始表格,但可以使用一些指导,了解如何以编程方式将未使用的案例移动到新表格中并使用&#34;原因&#34;代码。
提前致谢!我是R和stackoverflow的新手,所以如果你有关于如何改进我的问题的建议,请lmk。
答案 0 :(得分:2)
使用apply
逐行查找是否有任何条目NA
,然后删除(或子集)该行:
required_cols<-c("sex","race")
df1_with_NA<-df1[apply(is.na(df1[,required_cols]),1,any),]
df1_without_NA<-df1[!apply(is.na(df1[,required_cols]),1,any),]
df1_with_NA$reason<-lapply(
apply(is.na(df1_with_NA[,required_cols]),1,function(x){
required_cols[which(x)] }),paste,collapse=",")
检查输出:
> df1_with_NA
id dob sex race survey reason
4 4 2003-03-03 <NA> AA <NA> sex
5 5 1999-01-01 <NA> <NA> 1 sex,race
6 6 <NA> M <NA> 0 race
> df1_without_NA
id dob sex race survey
1 1 2001-01-01 F HA 1
2 2 2002-02-02 F HA <NA>
3 3 <NA> M W <NA>
如果需要,您可以更新原始表格df1<-df1_without_NA
。
答案 1 :(得分:0)
执行此操作的一种方法是循环遍历数据框并使用链式if语句使用is.na()
标识哪些行为NA。
df1=data.frame(id=c(1:6),
dob=as.Date(c("1/1/2001","2/2/2002",NA,"3/3/2003","1/1/1999",NA),"%m/%d/%Y"),
sex=c("F","F","M",NA,NA,"M"),
race=c("HA","HA","W","AA",NA,NA),
survey=c("1",NA,NA,NA,"1","0"))
for(i in 1:nrow(df1)){
if(is.na(df1$sex[i]) == T & is.na(df1$race[i]) == T){
df1$reason[i] = 'sex & race'
}else if( is.na(df1$sex[i]) == T){
df1$reason[i] = 'sex'
}else if( is.na(df1$race[i]) == T){
df1$reason[i] = 'race'
}else{
df1$reason[i] = NA
}
}
df1
# then subset the new df1 where reason is not NA to get the deleted rows
df2 = subset(df1, df1$reason == NA)
这是一种蛮力方法,但它有效
id dob sex race survey reason
1 1 2001-01-01 F HA 1 <NA>
2 2 2002-02-02 F HA <NA> <NA>
3 3 <NA> M W <NA> <NA>
4 4 2003-03-03 <NA> AA <NA> sex
5 5 1999-01-01 <NA> <NA> 1 sex & race
6 6 <NA> M <NA> 0 race