Question

我正在尝试从多个csvs在R中构建数据库。在每个csv中都有NAs，我想构建一个主列表，汇总单个数据库中的所有csv。这里有一些快速代码说明我的问题（大多数csv实际上有1000个条目，我想自动化这个过程）：

d1=data.frame(common=letters[1:5],species=paste(LETTERS[1:5],letters[1:5],sep='.'))
d1$species[1]=NA
d1$common[2]=NA
d2=data.frame(common=letters[1:5],id=1:5)
d2$id[3]=NA
d3=data.frame(species=paste(LETTERS[1:5],letters[1:5],sep='.'),id=1:5)

我一直在圈子（写循环），试图使用合并和重塑（融化/演员）没有太多运气，努力简洁地总结可用的信息。这看起来非常基本，但我无法找到一个好方法。提前谢谢。

To be clear, I am aiming for a final database like this:
  common species id
1      a     A.a  1
2      b     B.b  2
3      c     C.c  3
4      d     D.d  4
5      e     E.e  5

Answer 1

我最近有过类似的情况。下面将遍历所有变量并返回最可能的信息以重新添加到数据集中。一旦所有数据都存在，最后一次在第一个变量上运行就会得到结果。

#combine all into one dataframe
require(gtools)
d <- smartbind(d1,d2,d3)

#function to get the first non NA result
getfirstnonna <- function(x){
  ret <- head(x[which(!is.na(x))],1)
  ret <- ifelse(is.null(ret),NA,ret)
  return(ret)
}

#function to get max info based on one variable
runiteration <- function(dataset,variable){
  require(plyr)
  e <- ddply(.data=dataset,.variables=variable,.fun=function(x){apply(X=x,MARGIN=2,FUN=getfirstnonna)})
  #returns the above without the NA "factor"
  return(e[which(!is.na(e[ ,variable])), ])
}

#run through all variables
for(i in 1:length(names(d))){
  d <- rbind(d,runiteration(d,names(d)[i]))
}
#repeat first variable since all possible info should be available in dataset
d <- runiteration(d,names(d)[1])

如果id，种类等在不同的数据集中有所不同，那么这将返回顶部的非NA数据。在这种情况下，更改d中的行顺序并更改变量顺序可能会影响结果。更改getfirstnonna函数将改变这种行为（尾部会选择最后一个，甚至可能获得所有可能性）。您可以按最完整的记录对数据集进行排序。

在R中合并具有缺失值（NAs）的多个条件中的数据库

1 个答案: