基于多列的值递归应用grep

时间:2017-02-03 04:52:15

标签: r

我希望来自多个列的grep值,然后在发生冲突时分配优先级。我能够编写一个可行的代码,但它太重复了,因为我并没有充分利用R中矢量化操作的强大功能。我正在寻找一个lapply,{{1}的解决方案等等。

我试着用sapply试探,但卡住了。

以下是我的数据:

lapply

以下是我的代码:

dput(DF)
structure(list(S6 = c("FED AIR FORCE", "FED AIR FORCE", "FED AIR FORCE", 
"FED MARINES", "FED MARINES", "FED MARINES", "FED NAVY", "FED NAVY", 
"FED NAVY", "FED NAVY", "FEDERAL", "STATE", "STATE"), S.Name = c("MARINE", 
"ARMY", "AIR FORCE", "MARINE", "ARMY", "AIR FORCE", "MARINE", 
"ARMY", "AIR FORCE", "NAVY", "NAVY", "AIR FORCE", "FEDERAL"), 
    Dept = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_), Number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 
    10, 11, 12, 12)), .Names = c("S6", "S.Name", "Dept", "Number"
), row.names = c(NA, 13L), class = "data.frame")

评论:首先我读Divisions<-c("Air Force", "Army", "Navy", "Marine") DF[grep("AIR FORCE",DF$S.Name,ignore.case = TRUE),"Dept"]<-"Air Force" DF[grep("Army",DF$S.Name,ignore.case = TRUE),"Dept"]<-"Army" DF[grep("Navy",DF$S.Name,ignore.case = TRUE),"Dept"]<-"Navy" DF[grep("Marine",DF$S.Name,ignore.case = TRUE),"Dept"]<-"Marine" DF[grep("AIR FORCE",DF$S6,ignore.case = TRUE),"Dept"]<-"Air Force" DF[grep("Army",DF$S6,ignore.case = TRUE),"Dept"]<-"Army" DF[grep("Navy",DF$S6,ignore.case = TRUE),"Dept"]<-"Navy" DF[grep("Marine",DF$S6,ignore.case = TRUE),"Dept"]<-"Marine" ,如果有匹配,我会写下比赛。然后,我读了S.Name,如果匹配,我会覆盖它。因此,S6的优先级高于S6

运行以上操作后的预期输出:

S.Name

我希望能够对此进行矢量化......即。使用dput(DF) structure(list(S6 = c("FED AIR FORCE", "FED AIR FORCE", "FED AIR FORCE", "FED MARINES", "FED MARINES", "FED MARINES", "FED NAVY", "FED NAVY", "FED NAVY", "FED NAVY", "FEDERAL", "STATE", "STATE"), S.Name = c("MARINE", "ARMY", "AIR FORCE", "MARINE", "ARMY", "AIR FORCE", "MARINE", "ARMY", "AIR FORCE", "NAVY", "NAVY", "AIR FORCE", "FEDERAL"), Dept = c("Air Force", "Air Force", "Air Force", "Marine", "Marine", "Marine", "Navy", "Navy", "Navy", "Navy", "Navy", "Air Force", NA), Number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12)), .Names = c("S6", "S.Name", "Dept", "Number" ), row.names = c(NA, 13L), class = "data.frame")

这是我尝试使用lapply

的内容
lapply

不幸的是,我不知道如何分配优先级并将值反馈给数据框。 我很感激任何想法。

1 个答案:

答案 0 :(得分:1)

我们可以使用str_extract提取&#39; S6&#39;中的特定字词。来自&#39;部门的专栏&#39; vector paste将这些元素组合在一起,然后使用gsub将其更改为驼峰大小写,方法是在单词边界(\\b)后跟一个大写字母后跟一个或更多大写字母(([A-Z]+)),作为一个组捕获,在替换中,我们使用第一个(\\1)的反向引用,然后指定小写(\\L)第二个反向引用(\\2

library(stringr)
DF$Dept <- gsub("(\\b[A-Z])([A-Z]+)", "\\1\\L\\2", str_extract(DF$S6, 
         paste(toupper(Divisions), collapse="|")), perl = TRUE)
DF$Dept
#[1] "Air Force" "Air Force" "Air Force" "Marine"    "Marine"    "Marine"    "Navy"      "Navy"      "Navy"      "Navy"      NA         
#[12] NA          NA      

如果&#39; Dept&#39;中有NA个元素,请在&#39; S.Name&#39;

中应用相同的方法进行更改
i1 <- is.na(DF$Dept)
DF$Dept[i1] <- gsub("(\\b[A-Z])([A-Z]+)", "\\1\\L\\2", 
     str_extract(DF$S.Name[i1],  paste(toupper(Divisions), collapse="|")), perl = TRUE)
DF$Dept
#[1] "Air Force" "Air Force" "Air Force" "Marine"    "Marine"    "Marine"    "Navy"      "Navy"      "Navy"      "Navy"      "Navy"     
#[12] "Air Force" NA