重命名因子级别以数据帧子集内的值匹配为条件

时间:2017-06-12 17:44:02

标签: r

我正在尝试为子集中的值匹配条件分配因子lepsp的空白级别的名称。数据的一个示例包括:

df<- 
  plantfam        lepfam         lepsp              lepcn
  Asteraceae      Geometridae    Eois sp            green/spikes
  Asteraceae      Erebidae       Anoba sp           green/nospikes                    
  Asteraceae      Erebidae                          green/nospikes            
  Melastomaceae   Noctuidae      Balsinae sp             
  Poaceae         Erebidae       Deinopa sp         black/orangespots
  Poaceae         Erebidae                          black/orangespots
  Poaceae         Erebidae       Cocytia sp         black/yellowspots
  Poaceae                                           black/yellowspots

以下是数据框的代码:

df<-data.frame( plantfam= c("Asteraceae","Asteraceae","Asteraceae", 
"Melastomaceae","Poaceae","Poaceae","Poaceae","Poaceae"), lepfam= 
c("Geometridae", "Erebidae","Erebidae", 
"Noctuidae","Erebidae","Erebidae","Erebidae",""), lepsp= c("Eois sp", 
"Anoba sp", "", "Balsinae sp", "Deinopa sp", "", "Cocytia sp", ""), 
lepcn= c("green/spikes","green/nospikes", "green/nospikes","", 
"black/orangespots", "black/orangespots", "black/yellowspots", 
"black/yellowspots"))

如果lepsp为空但有lepcnlepcn与另一个lepsp相匹配,那么plantfam空白lepsp将给出这些条件匹配的lepsp名称。因此,使用相同lepfam的相同plantfam的每个lepcn子集将被指定为相同的名称。

 output<- 
    plantfam        lepfam         lepsp              lepcn
    Asteraceae      Geometridae    Eois sp            green/spikes
    Asteraceae      Erebidae       Anoba sp           green/nospikes                    
    Asteraceae      Erebidae       Anoba sp           green/nospikes            
    Melastomaceae   Noctuidae      Balsinae sp             
    Poaceae         Erebidae       Deinopa sp       black/orangespots
    Poaceae         Erebidae       Deinopa sp       black/orangespots
    Poaceae         Erebidae       Cocytia sp       black/yellowspots
    Poaceae                        Cocytia sp       black/yellowspots

我尝试过以下各种变体而没有成功: https://stackoverflow.com/a/44479195/8061255

1 个答案:

答案 0 :(得分:0)

直接的基础R,有利于检查要重命名的组合。实质上,您将获得plantfam / lepfam / lepcn组合的唯一列表,并将其与原始数据集合并:

读入数据并确保格式符合预期:

df<- read.csv(text = 
'plantfam,lepfam,lepsp,lepcn
Asteraceae,Geometridae,Eois sp,green/spikes
Asteraceae,Erebidae,Anoba sp,green/nospikes
Asteraceae,Erebidae,NA,green/nospikes
Melastomaceae,Noctuidae,Balsinae sp,NA
Poaceae,Erebidae,Deinopa sp,black/orangespots
Poaceae,Erebidae,NA,black/orangespots
Poaceae,Erebidae,NA,balck/yellowspots')

# assumes blanks are NA
# if blanks are actually empty strings "" then turn those into NA's

# make sure everything is a character, not a factor
df <- as.data.frame(apply(df,2,as.character),stringsAsFactors = F)

解决方案:

# get a unique list of all combinations that don't have missing data
dflookup <- unique(na.omit(df))

# inspect combinations to be renamed, there should be no duplicate plantfam/lepfam/lepcn combinations
dflookup

# use the lookup to merge in all known names
newdf <- merge(df,dflookup,by = c('plantfam','lepfam','lepcn'),all.x = T,suffixes = c('old','new'))

# use original lepsp when new lepsp is NA
newdf$lepsp <- ifelse(is.na(newdf$lepspnew),newdf$lepspold,newdf$lepspnew)

# remove unneeded columns
newdf$lepspold <- newdf$lepspnew <- NULL

# turn back into factors if desired
newdf <- as.data.frame(apply(newdf,2,as.factor))