在R完成了一些基础课程后,我一直试图在R中完成我的第一项任务。
我有大约一百万条记录的数据框(比如DATA)和大约100条记录的另一个数据框(比如LOOKUP)。
我需要对DATA中的每条记录使用LOOKUP中存储的逻辑,并添加一个值为YES / NO的新列(如FOUND)。
请参阅下面的数据框以及一些示例数据:
> dataf <- data.frame(stringsAsFactors = FALSE, year=c(1980,1982,1985,1981,1970),name=c("abc","def","abc","klm","nop"),id=c("123bb23","234ab23","345bc23","123bc15","124bc45"))
> lookup <- data.frame(stringsAsFactors = FALSE, year=c(1980,1981,1982),name=c("abc","klm","nop"),digit=c(5,5,4),letter=c("b","c","b"))
> dataf
year name id
1 1980 abc 123bb23
2 1982 def 234ab23
3 1985 abc 345bc23
4 1981 klm 123bc15
5 1970 nop 124bc45
> lookup
year name digit letter
1 1980 abc 5 b
2 1981 klm 5 c
3 1982 nop 4 b
我需要输出如下所示:
year name id found
1 1980 abc 123bb23 YES
2 1982 def 234ab23 NO
3 1985 abc 345bc23 NO
4 1981 klm 123bc15 YES
5 1970 nop 124bc45 NO
我的职能:
#hybrid FUNCTION
hybridfun <- function(df, lukup){
for (j in 1:nrow(df)){
df$found = "NO"
for (i in 1:nrow(lukup)){
if (df[[1]][[j]] == lukup[[1]][[i]])
if (df[[2]][[j]] == lukup[[2]][[i]])
if (substring(df[[3]][[j]], lukup[[3]][[i]], lukup[[3]][[i]]) == lukup[[4]][[i]]){
df$found = "YES"
break
}
}
}
}
我正在调用以下函数:
hybridfun(dataf, lookup)
看起来它正在做某事,但输出并没有像我预期的那样显示
请有人帮忙。如果您认为您需要任何进一步的信息,请告诉我,我将编辑我的帖子。
答案 0 :(得分:1)
library(dplyr)
library(tidyr)
dataf2 <- dataf %>%
left_join(lookup, by = c("year", "name")) %>%
mutate(found = case_when(
str_sub(id, start = digit, end = digit) == letter ~ "YES",
TRUE ~ "NO"
)) %>%
select(-digit, -letter)
dataf2
# year name id found
# 1 1980 abc 123bb23 YES
# 2 1982 def 234ab23 NO
# 3 1985 abc 345bc23 NO
# 4 1981 klm 123bc15 YES
# 5 1970 nop 124bc45 NO
我们也可以把它变成一个函数。
hybridfun <- function(dataf, lookup){
dataf2 <- dataf %>%
dplyr::left_join(lookup, by = c("year", "name")) %>%
dplyr::mutate(found = dplyr::case_when(
stringr::str_sub(id, start = digit, end = digit) == letter ~ "YES",
TRUE ~ "NO"
)) %>%
dplyr::select(-digit, -letter)
return(dataf2)
}
hybridfun(dataf, lookup)
# year name id found
# 1 1980 abc 123bb23 YES
# 2 1982 def 234ab23 NO
# 3 1985 abc 345bc23 NO
# 4 1981 klm 123bc15 YES
# 5 1970 nop 124bc45 NO
答案 1 :(得分:1)
x=do.call(paste,cbind(dataf[1:2],substring(dataf$id,lookup$digit,lookup$digit)))
y=do.call(paste,lookup[-3])
dataf$found=ifelse(x%in%y,"YES","NO")
dataf
year name id found
1 1980 abc 123bb23 YES
2 1982 def 234ab23 NO
3 1985 abc 345bc23 NO
4 1981 klm 123bc15 YES
5 1970 nop 124bc45 NO