按ID匹配并将列值除以两个数据框

时间:2017-01-03 20:48:03

标签: r dataframe dplyr

帧:

df 1:包含多个具有500列值的相同id的行

    id|val.1|val.2|...|val.500
---------------------------------
    1 | 240 | 234 |...|228
    1 | 224 | 222 |...|230
    1 | 238 | 240 |...|240
    2 | 277 | 270 |...|255
    2 | 291 | 290 |...|265
    2 | 284 | 282 |...|285

df 2:只包含一个与df-1 id列匹配的唯一ID(行)和500列值

    id|val.1|val.2|...|val.500
---------------------------------
    1 | 250 | 240 |...|245
    2 | 280 | 282 |...|281

我想根据dd将df 1列值除以df 2中的相应列值,最后得到df 3:

    id|val.1|val.2|...|val.500
---------------------------------
    1 | 0.96| 0.98|...|0.93
    1 | 0.90| 0.93|...|0.94
    1 | 0.95| 1.00|...|0.98
    2 | 0.99| 0.96|...|0.91
    2 | 1.04| 1.03|...|0.94
    2 | 1.01| 1.00|...|1.01

基本上根据ID和列值将df 1值加权df 2。我现在已经摸不着头脑了解最好的方法,并没有取得多大进展。任何指导将不胜感激。感谢

2 个答案:

答案 0 :(得分:3)

两种可能的方法:

1:'广泛'方法

使用dplyrpurrr个包:

library(dplyr)
library(purrr)

df12 <- left_join(df1, df2, by = 'id')
cbind(id=df12[,1], map2_df(df12[,2:4], df12[,5:7], `/`))

使用data.table包(从here借来的方法):

library(data.table)

# convert to 'data.tables'
setDT(df1)
setDT(df2)

# creates two vectors of matching columnnames
xcols = names(df1)[-1]
icols = paste0("i.", xcols)

# join and do the calculation
df1[df2, on = 'id', Map('/', mget(xcols), mget(icols)), by = .EACHI]

两者都给出了:

   id     val.1     val.2     val.3
1:  1 0.9600000 0.9750000 0.9306122
2:  1 0.8960000 0.9250000 0.9387755
3:  1 0.9520000 1.0000000 0.9795918
4:  2 0.9892857 0.9574468 0.9074733
5:  2 1.0392857 1.0283688 0.9430605
6:  2 1.0142857 1.0000000 1.0142349

2:'长'接近

另一种选择是将数据帧重新整形为长格式,然后merge / join,并进行计算。

使用data.table - 包:

library(data.table)

dt1 <- melt(setDT(df1), id = 1)
dt2 <- melt(setDT(df2), id = 1)

dt1[dt2, on = c('id','variable'), value := value/i.value][]

使用dplyrtidyr个包:

library(dplyr)
library(tidyr)

df1 %>% 
  gather(variable, value, -id) %>% 
  left_join(., df2 %>% gather(variable, value, -id), by = c('id','variable')) %>% 
  mutate(value = value.x/value.y) %>% 
  select(id, variable, value)

两者都给出了:

    id variable     value
 1:  1    val.1 0.9600000
 2:  1    val.1 0.8960000
 3:  1    val.1 0.9520000
 4:  2    val.1 0.9892857
 5:  2    val.1 1.0392857
 6:  2    val.1 1.0142857
 7:  1    val.2 0.9750000
 8:  1    val.2 0.9250000
 9:  1    val.2 1.0000000
10:  2    val.2 0.9574468
11:  2    val.2 1.0283688
12:  2    val.2 1.0000000
13:  1    val.3 0.9306122
14:  1    val.3 0.9387755
15:  1    val.3 0.9795918
16:  2    val.3 0.9074733
17:  2    val.3 0.9430605
18:  2    val.3 1.0142349

使用过的数据:

df1 <- structure(list(id = c(1, 1, 1, 2, 2, 2), val.1 = c(240, 224, 238, 277, 291, 284), 
                      val.2 = c(234, 222, 240, 270, 290, 282), val.3 = c(228, 230, 240, 255, 265, 285)), 
                 .Names = c("id", "val.1", "val.2", "val.3"), class = "data.frame", row.names = c(NA, -6L))

df2 <- structure(list(id = c(1, 2), val.1 = c(250, 280), val.2 = c(240, 282), val.3 = c(245, 281)),
                 .Names = c("id", "val.1", "val.2", "val.3"), class = "data.frame", row.names = c(NA, -2L))

答案 1 :(得分:0)

只要data.frames按列正确排序并且两者具有相同的列,我认为以下基本R代码将完成您想要的任务。

cbind(df1[1], df1[-1] / df2[match(df1$id, df2$id), -1])

  id     val.1     val.2   val.500
1  1 0.9600000 0.9750000 0.9306122
2  1 0.8960000 0.9250000 0.9387755
3  1 0.9520000 1.0000000 0.9795918
4  2 0.9892857 0.9574468 0.9074733
5  2 1.0392857 1.0283688 0.9430605
6  2 1.0142857 1.0000000 1.0142349

这里,match(df1$id, df2$id)将返回与df2的id相对应的df1的行索引,因此df2[match(df1$id, df2$id), -1]将返回相应的df2行作为data.frame并删除了id变量。然后,当删除id变量并且df1[-1] / df2[match(df1$id, df2$id), -1]执行除法时,此data.frame将匹配形状中的df1。最后cbind将id变量添加到最终的data.frame。

数据

df1 <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), val.1 = c(240L, 
224L, 238L, 277L, 291L, 284L), val.2 = c(234L, 222L, 240L, 270L, 
290L, 282L), val.500 = c(228L, 230L, 240L, 255L, 265L, 285L)), .Names = c("id", 
"val.1", "val.2", "val.500"), class = "data.frame", row.names = c(NA, 
-6L))

df2 <- structure(list(id = 1:2, val.1 = c(250L, 280L), val.2 = c(240L, 
282L), val.500 = c(245L, 281L)), .Names = c("id", "val.1", "val.2", 
"val.500"), class = "data.frame", row.names = c(NA, -2L))