Question

我有两个数据集：

loc <- c("a","b","c","d","e")
id1 <- c(NA,9,3,4,5)
id2 <- c(2,3,7,5,6)
id3 <- c(2,NA,5,NA,7)
cost1 <- c(10,20,30,40,50)
cost2 <- c(50,20,30,30,50)
cost3 <- c(40,20,30,10,20)
dt <- data.frame(loc,id1,id2,id3,cost1,cost2,cost3)


id <- c(1,2,3,4,5,6,7)
rate <- c(0.9,0.8,0.7,0.6,0.5,0.4,0.3)
lookupd_tb <- data.frame(id,rate)

我想要做的是将dt中的值与lookup_tb中的id1，id2和id3匹配，如果匹配，则将该id乘以其相关成本。

这是我的方法：

dt <- dt %>% 
left_join(lookupd_tb , by=c("id1"="id")) %>%
dplyr :: mutate(cost1 = ifelse(!is.na(rate), cost1*rate, cost1)) %>% 
dplyr :: select (-rate)

我现在在做什么，工作正常，但我必须为每个变量重复3次，我想知道是否有更有效的方法来做到这一点（可能使用申请家庭？）

我尝试在我的查找表中加入所有三个带id的变量，但是当我的dt加入费率时，所有费用（cost1，cost2和cost3）将乘以我不同的费率想。

感谢您的帮助！

Answer 1

base R方法是循环遍历“id”的列。使用sapply/lapply，从＆＃39; id＆＃39;中获取match索引。＆＃39; lookupd_tb＆＃39;列，根据索引，获得相应的＆＃39;，replace NA元素为1，乘以＆＃39; cost＆＃39;列和更新费用＆＃39;列

nmid <- grep("id", names(dt))
nmcost <- grep("cost", names(dt))

dt[nmcost] <- dt[nmcost]*sapply(dt[nmid], function(x) {
         x1 <- lookupd_tb$rate[match(x, lookupd_tb$id)]
          replace(x1, is.na(x1), 1) })

或者使用tidyverse，我们可以遍历各组列，即＆＃39; id＆＃39;和＆＃39;成本＆＃39;使用purrr::map2，然后执行与上面相同的方法。唯一的区别是，我们在这里创建了新的列，而不是更新成本＆＃39;列

library(tidyverse)
dt %>% 
   select(nmid) %>% 
   map2_df(., dt %>% 
               select(nmcost), ~  
                 .x %>% 
                     match(., lookupd_tb$id) %>%
                     lookupd_tb$rate[.] %>% 
                     replace(., is.na(.),1) * .y ) %>%
    rename_all(~ paste0("costnew", seq_along(.))) %>%
    bind_cols(dt, .)

Answer 2

在tidyverse中，您还可以尝试通过将数据从宽转换为长

来替代方法

  library(tidyverse)
  dt %>% 
  # data transformation to long
  gather(k, v, -loc) %>% 
  mutate(ID=paste0("costnew", str_extract(k, "[:digit:]")),
         k=str_remove(k, "[:digit:]")) %>% 
  spread(k, v) %>% 
  # left_join and calculations of new costs
  left_join(lookupd_tb , by="id") %>% 
  mutate(cost_new=ifelse(is.na(rate), cost,rate*cost)) %>% 
  #  clean up and expected output
  select(loc, ID, cost_new) %>% 
  spread(ID, cost_new) %>% 
  left_join(dt,., by="loc")  # or %>% bind_cols(dt, .)
  loc id1 id2 id3 cost1 cost2 cost3 costnew1 costnew2 costnew3
1   a  NA   2   2    10    50    40       10       40       32
2   b   9   3  NA    20    20    20       20       14       20
3   c   3   7   5    30    30    30       21        9       15
4   d   4   5  NA    40    30    10       24       15       10
5   e   5   6   7    50    50    20       25       20        6

我们的想法是使用lef_joining和gather以适当的长格式为spread提供数据。 k与新索引列ID和spread的组合。在计算之后，我们将使用第二个dt转换为预期输出并绑定到df1['Date'] = pd.to_datetime(df1.Ano.astype(str) + '-' + df1.Meses.astype(str))

如何将几个变量的值与查找表中的变量进行匹配？

2 个答案: