r动态删除数据帧中的NA行并记录哪个字段为NA

时间:2016-02-26 17:29:41

标签: r filter dataframe

我有一个包含68列的数据框。我想根据优先级排序的变量名称向量动态检查无效数据。如果任何设置字段为NA,我希望将这些行移动到新数据框,该数据框将包含一个包含排除原因的附加列。

示例数据框(只有5列):

df1=data.frame(id=c(1:6),
   dob=as.Date(c("1/1/2001","2/2/2002",NA,"3/3/2003","1/1/1999",NA),"%m/%d/%Y"),
   sex=c("F","F","M",NA,NA,"M"),
   race=c("HA","HA","W","AA",NA,NA),
   survey=c("1",NA,NA,NA,"1","0"))

我希望能够动态定义required_cols。如果required_cols是:

required_cols<-c("sex","race")

我希望将性别和种族为NA的df1中的所有行移动到如下所示的输出表中:

 id dob        sex  race survey reason
 4  2003-03-03 <NA>   AA   <NA> sex
 5  1999-01-01 <NA> <NA>      1 sex
 6       <NA>    M  <NA>      0 race

并将原始表更新为:

  id dob          sex race survey
  1  2001-01-01   F   HA      1
  2  2002-02-02   F   HA   <NA>
  3       <NA>    M   W    <NA>

如果required_cols是required_cols<-c("sex","survey"),我希望输出表为:

  id        dob  sex race survey reason
2  2 2002-02-02    F   HA   <NA> survey
3  3       <NA>    M    W   <NA> survey
4  4 2003-03-03 <NA>   AA   <NA> survey
5  5 1999-01-01 <NA> <NA>      1 sex

和原始表:

  id        dob sex race survey
1  1 2001-01-01   F   HA      1
6  6       <NA>   M <NA>      0

我可以使用complete.cases更新原始表格,但可以使用一些指导,了解如何以编程方式将未使用的案例移动到新表格中并使用&#34;原因&#34;代码。

提前致谢!我是R和stackoverflow的新手,所以如果你有关于如何改进我的问题的建议,请lmk。

2 个答案:

答案 0 :(得分:2)

使用apply逐行查找是否有任何条目NA,然后删除(或子集)该行:

required_cols<-c("sex","race")
df1_with_NA<-df1[apply(is.na(df1[,required_cols]),1,any),]
df1_without_NA<-df1[!apply(is.na(df1[,required_cols]),1,any),]
df1_with_NA$reason<-lapply(
      apply(is.na(df1_with_NA[,required_cols]),1,function(x){
       required_cols[which(x)] }),paste,collapse=",")

检查输出:

> df1_with_NA
  id        dob  sex race survey   reason
4  4 2003-03-03 <NA>   AA   <NA>      sex
5  5 1999-01-01 <NA> <NA>      1 sex,race
6  6       <NA>    M <NA>      0     race

> df1_without_NA
  id        dob sex race survey
1  1 2001-01-01   F   HA      1
2  2 2002-02-02   F   HA   <NA>
3  3       <NA>   M    W   <NA>

如果需要,您可以更新原始表格df1<-df1_without_NA

答案 1 :(得分:0)

执行此操作的一种方法是循环遍历数据框并使用链式if语句使用is.na()标识哪些行为NA。

df1=data.frame(id=c(1:6),
               dob=as.Date(c("1/1/2001","2/2/2002",NA,"3/3/2003","1/1/1999",NA),"%m/%d/%Y"),
               sex=c("F","F","M",NA,NA,"M"),
               race=c("HA","HA","W","AA",NA,NA),
               survey=c("1",NA,NA,NA,"1","0"))

for(i in 1:nrow(df1)){
  if(is.na(df1$sex[i]) == T & is.na(df1$race[i]) == T){
    df1$reason[i] = 'sex & race'
  }else if( is.na(df1$sex[i]) == T){
    df1$reason[i] = 'sex'
  }else if( is.na(df1$race[i]) == T){
    df1$reason[i] = 'race'
  }else{
    df1$reason[i] = NA
  }
}
df1
# then subset the new df1 where reason is not NA to get the deleted rows
df2 = subset(df1, df1$reason == NA)

这是一种蛮力方法,但它有效

 id        dob  sex race survey     reason
1  1 2001-01-01    F   HA      1       <NA>
2  2 2002-02-02    F   HA   <NA>       <NA>
3  3       <NA>    M    W   <NA>       <NA>
4  4 2003-03-03 <NA>   AA   <NA>        sex
5  5 1999-01-01 <NA> <NA>      1 sex & race
6  6       <NA>    M <NA>      0       race