通过应用自定义多参数函数创建新列

时间:2017-01-17 14:01:30

标签: r

如果之前已经提出过这个问题,我很抱歉(我知道有类似的问题here),但我一直在努力解决这个问题几个小时,无法找到解决方案。

以下是我的数据框示例:

mydf1 <- structure(list(r_id = c(574111L, 291615L, 328543L),
  a_name = c("Daft Punk", "Daft Punk", "Daft Punk"),
  r_title = c("Discovery", "TRON: Legacy", "Random Access Memories")),
  .Names = c("r_id", "a_name", "r_title"),
  row.names = c(NA, 3L),
  class = "data.frame")

mydf2 <- structure(list(date_y = c(2015, 2015, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014),
  date_m = c(3, 3, 6, 5, 5, 5, 5, 5, 5, 4),
  date_d = c(28, 21, 7, 31, 24, 17, 17, 10, 3, 26),
  a_name = c("Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk", "Daft Punk"),
  r_title = c("Discovery", "Discovery", "Random Access Memories", "Random Access Memories", "Random Access Memories", "Random Access Memories", "Discovery", "Random Access Memories", "Random Access Memories", "Random Access Memories"),
  b_rank = c(110, 117, 114, 104, 95, 64, 99, 51, 63, 45),
  l_rank = c(4.52178857704904, 4.44265125649032, 4.47733681447821, 4.58496747867057, 4.67282883446191, 4.92725368515721, 4.63472898822964, 5.01727983681492, 4.93447393313069, 5.05624580534831)),
  .Names = c("date_y", "date_m", "date_d", "a_name", "r_title", "b_rank", "l_rank"),
  row.names = c(NA, -10L),
  class = "data.frame")

我想在mydf1添加一列,其中包含以下函数返回的值:

myfunction1 <- function(this_a, this_r){
tot_w <- subset(mydf2, a_name == this_a & r_title == this_r)
return(sum(tot_w$l_rank, na.rm = TRUE))}

作为R的新手并且仍然习惯于在VBA中使用循环,我的想法是该函数将a_name中的r_titlemydf1的值作为参数,转到mydf2并对匹配的行进行子集(如果有),然后对l_rank中的值求和。结果应该是:

mydf3 <- structure(list(r_id = c(574111L, 291615L, 328543L),
  a_name = c("Daft Punk", "Daft Punk", "Daft Punk"),
  r_title = c("Discovery", "TRON: Legacy", "Random Access Memories"),
  l_rank = c("13.59917", "0.000000", "33.67039")),
  .Names = c("r_id", "a_name", "r_title", "l_rank"),
  row.names = c(NA, 3L),
  class = "data.frame")

一种解决方案如下:

mydf3 <- mydf1 %>%
  rowwise() %>%
  mutate(l_rank = myfunction1(a_name, r_title))

这似乎有效,但鉴于我必须在大量行上运行它,我怀疑它太慢了。看看上面链接的问题的答案,我尝试使用apply如下:

mydf3 <- mydf1
mydf3$l_rank <- apply(mydf1, 1,
  function(x, y) myfunction1(mydf1["a_name"], mydf1["r_title"]))

但这不会产生预期的结果。我也试过这样使用data.table

mydf3 <- data.table(mydf1)
mydf3[, l_rank := myfunction1(mydf3$a_name, mydf3$r_title)]

也无济于事。如果有人能告诉我我做错了什么,我会非常感激,因为这让我很头疼。

修改 请注意,mydf1中的行可以是重复的。

2 个答案:

答案 0 :(得分:3)

如果您想继续dplyr,可以使用以下内容:

sumdf <- mydf2 %>% group_by(a_name, r_title) %>% 
  summarise(l_rank=sum(l_rank, na.rm=TRUE))

mydf1 %>% merge(sumdf, by=c('a_name','r_title'), all.x=TRUE)

我不使用该功能,但使用dplyr::summarise

聚合

或者在评论中提到的一个管道中:

mydf2 %>% group_by(a_name, r_title) %>% 
  summarise(l_rank=sum(l_rank, na.rm=TRUE)) %>%
  right_join(mydf1, by = c('a_name','r_title'))

答案 1 :(得分:1)

转换&#39; data.frame&#39;后,我们可以使用data.table加入到&#39; data.table&#39; (setDT)。

library(data.table)
mydf1[, l_rank := setDT(mydf2)[mydf1, .(l_rank=sum(l_rank)),
          on = .(a_name, r_title), by = .EACHI]$l_rank]
#     r_id    a_name                r_title   l_rank
#1: 574111 Daft Punk              Discovery 13.59917
#2: 291615 Daft Punk           TRON: Legacy       NA
#3: 328543 Daft Punk Random Access Memories 33.67039