如果之前已经提出过这个问题,我很抱歉(我知道有类似的问题here),但我一直在努力解决这个问题几个小时,无法找到解决方案。
以下是我的数据框示例:
mydf1 <- structure(list(r_id = c(574111L, 291615L, 328543L),
a_name = c("Daft Punk", "Daft Punk", "Daft Punk"),
r_title = c("Discovery", "TRON: Legacy", "Random Access Memories")),
.Names = c("r_id", "a_name", "r_title"),
row.names = c(NA, 3L),
class = "data.frame")
mydf2 <- structure(list(date_y = c(2015, 2015, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014),
date_m = c(3, 3, 6, 5, 5, 5, 5, 5, 5, 4),
date_d = c(28, 21, 7, 31, 24, 17, 17, 10, 3, 26),
a_name = c("Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk"),
r_title = c("Discovery", "Discovery", "Random Access Memories", "Random Access Memories", "Random Access Memories", "Random Access Memories", "Discovery", "Random Access Memories", "Random Access Memories", "Random Access Memories"),
b_rank = c(110, 117, 114, 104, 95, 64, 99, 51, 63, 45),
l_rank = c(4.52178857704904, 4.44265125649032, 4.47733681447821, 4.58496747867057, 4.67282883446191, 4.92725368515721, 4.63472898822964, 5.01727983681492, 4.93447393313069, 5.05624580534831)),
.Names = c("date_y", "date_m", "date_d", "a_name", "r_title", "b_rank", "l_rank"),
row.names = c(NA, -10L),
class = "data.frame")
我想在mydf1
添加一列,其中包含以下函数返回的值:
myfunction1 <- function(this_a, this_r){
tot_w <- subset(mydf2, a_name == this_a & r_title == this_r)
return(sum(tot_w$l_rank, na.rm = TRUE))}
作为R的新手并且仍然习惯于在VBA中使用循环,我的想法是该函数将a_name
中的r_title
和mydf1
的值作为参数,转到mydf2
并对匹配的行进行子集(如果有),然后对l_rank
中的值求和。结果应该是:
mydf3 <- structure(list(r_id = c(574111L, 291615L, 328543L),
a_name = c("Daft Punk", "Daft Punk", "Daft Punk"),
r_title = c("Discovery", "TRON: Legacy", "Random Access Memories"),
l_rank = c("13.59917", "0.000000", "33.67039")),
.Names = c("r_id", "a_name", "r_title", "l_rank"),
row.names = c(NA, 3L),
class = "data.frame")
一种解决方案如下:
mydf3 <- mydf1 %>%
rowwise() %>%
mutate(l_rank = myfunction1(a_name, r_title))
这似乎有效,但鉴于我必须在大量行上运行它,我怀疑它太慢了。看看上面链接的问题的答案,我尝试使用apply
如下:
mydf3 <- mydf1
mydf3$l_rank <- apply(mydf1, 1,
function(x, y) myfunction1(mydf1["a_name"], mydf1["r_title"]))
但这不会产生预期的结果。我也试过这样使用data.table
:
mydf3 <- data.table(mydf1)
mydf3[, l_rank := myfunction1(mydf3$a_name, mydf3$r_title)]
也无济于事。如果有人能告诉我我做错了什么,我会非常感激,因为这让我很头疼。
修改
请注意,mydf1
中的行可以是重复的。
答案 0 :(得分:3)
如果您想继续dplyr
,可以使用以下内容:
sumdf <- mydf2 %>% group_by(a_name, r_title) %>%
summarise(l_rank=sum(l_rank, na.rm=TRUE))
mydf1 %>% merge(sumdf, by=c('a_name','r_title'), all.x=TRUE)
我不使用该功能,但使用dplyr::summarise
或者在评论中提到的一个管道中:
mydf2 %>% group_by(a_name, r_title) %>%
summarise(l_rank=sum(l_rank, na.rm=TRUE)) %>%
right_join(mydf1, by = c('a_name','r_title'))
答案 1 :(得分:1)
转换&#39; data.frame&#39;后,我们可以使用data.table
加入到&#39; data.table&#39; (setDT
)。
library(data.table)
mydf1[, l_rank := setDT(mydf2)[mydf1, .(l_rank=sum(l_rank)),
on = .(a_name, r_title), by = .EACHI]$l_rank]
# r_id a_name r_title l_rank
#1: 574111 Daft Punk Discovery 13.59917
#2: 291615 Daft Punk TRON: Legacy NA
#3: 328543 Daft Punk Random Access Memories 33.67039