合并具有重复列名称的多个数据表

时间:2015-09-11 15:21:02

标签: r join merge duplicates data.table

我正在尝试合并(加入)多个数据表(使用来自5个csv文件的fread获得)以形成单个数据表。当我尝试合并5个数据表时出错,但是当我仅合并4. MWE时,工作正常:

# example data
DT1 <- data.table(x = letters[1:6], y = 10:15)
DT2 <- data.table(x = letters[1:6], y = 11:16)
DT3 <- data.table(x = letters[1:6], y = 12:17)
DT4 <- data.table(x = letters[1:6], y = 13:18)
DT5 <- data.table(x = letters[1:6], y = 14:19)

# this gives an error
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))
  

merge.data.table(...,all = TRUE,by =“x”)出错:x有一些   重复的列名:y.x,y.y。请删除或重命名   重复,然后再试一次。

# whereas this works fine
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4))

    x y.x y.y y.x y.y 
 1: a  10  11  12  13 
 2: b  11  12  13  14 
 3: c  12  13  14  15 
 4: d  13  14  15  16 
 5: e  14  15  16  17 
 6: f  15  16  17  18

我有一个解决方法,如果我更改了DT1的第二列名称:

setnames(DT1, "y", "new_y")

# this works now
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

为什么会发生这种情况,有没有办法在不更改任何列名的情况下合并任意数量的数据表和相同的列名?

6 个答案:

答案 0 :(得分:8)

如果它只是那5个数据表(其中<select name='select-colour-0' id='select-colour-0' onchange="onColourSelected(this, '0', '1', 'style-img-thb-', '_WLG.jpg', '_WXL.jpg', '_SKC.jpg');displayUpdateMsg();return false;"> <option value="0" selected>orchid pink</option> </select> <script> alert(document.getElementById("select-colour-0").getElementsByTagName("option")[document.getElementById("select-colour-0").selectedIndex].innerHTML); </script>对所有数据表都相同),你也可以使用嵌套连接:

x

或者@Frank在评论中说:

# set the key for each datatable to 'x'
setkey(DT1,x)
setkey(DT2,x)
setkey(DT3,x)
setkey(DT4,x)
setkey(DT5,x)

# the nested join
mergedDT1 <- DT1[DT2[DT3[DT4[DT5]]]]

给出:

DTlist <- list(DT1,DT2,DT3,DT4,DT5)
Reduce(function(X,Y) X[Y], DTlist)

这给出了与结果相同的结果:

   x y1 y2 y3 y4 y5
1: a 10 11 12 13 14
2: b 11 12 13 14 15
3: c 12 13 14 15 16
4: d 13 14 15 16 17
5: e 14 15 16 17 18
6: f 15 16 17 18 19

当您的mergedDT2 <- Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5)) > identical(mergedDT1,mergedDT2) [1] TRUE 列没有相同的值时,嵌套连接将无法提供所需的解决方案:

x

这给出了:

DT1[DT2[DT3[DT4[DT5[DT6]]]]]

虽然:

   x y1 y2 y3 y4 y5 y6
1: b 11 12 13 14 15 15
2: c 12 13 14 15 16 16
3: d 13 14 15 16 17 17
4: e 14 15 16 17 18 18
5: f 15 16 17 18 19 19
6: g NA NA NA NA NA 20

给出:

Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5, DT6))

使用过的数据:

为了使 x y1 y2 y3 y4 y5 y6 1: a 10 11 12 13 14 NA 2: b 11 12 13 14 15 15 3: c 12 13 14 15 16 16 4: d 13 14 15 16 17 17 5: e 14 15 16 17 18 18 6: f 15 16 17 18 19 19 7: g NA NA NA NA NA 20 的代码生效,我更改了Reduce列的名称。

y

答案 1 :(得分:6)

如果要在合并期间重命名,可以使用以下方法将计数器保留在Reduce内:

Reduce((function() {counter = 0
                    function(x, y) {
                      counter <<- counter + 1
                      d = merge(x, y, all = T, by = 'x')
                      setnames(d, c(head(names(d), -1), paste0('y.', counter)))
                    }})(), list(DT1, DT2, DT3, DT4, DT5))
#   x y.x y.1 y.2 y.3 y.4
#1: a  10  11  12  13  14
#2: b  11  12  13  14  15
#3: c  12  13  14  15  16
#4: d  13  14  15  16  17
#5: e  14  15  16  17  18
#6: f  15  16  17  18  19

答案 2 :(得分:4)

堆叠和重塑我认为这并不完全映射到merge函数,但是......

mycols <- "x"
DTlist <- list(DT1,DT2,DT3,DT4,DT5)

dcast(rbindlist(DTlist,idcol=TRUE), paste0(paste0(mycols,collapse="+"),"~.id"))

#    x  1  2  3  4  5
# 1: a 10 11 12 13 14
# 2: b 11 12 13 14 15
# 3: c 12 13 14 15 16
# 4: d 13 14 15 16 17
# 5: e 14 15 16 17 18
# 6: f 15 16 17 18 19

我没有意识到这是否会延伸到列数多于y

合并分配

DT <- Reduce(function(...) merge(..., all = TRUE, by = mycols), 
  lapply(DTlist,`[.noquote`,mycols))

for (k in seq_along(DTlist)){
  js = setdiff( names(DTlist[[k]]), mycols )
  DT[DTlist[[k]], paste0(js,".",k) := mget(paste0("i.",js)), on=mycols, by=.EACHI]
}

#    x y.1 y.2 y.3 y.4 y.5
# 1: a  10  11  12  13  14
# 2: b  11  12  13  14  15
# 3: c  12  13  14  15  16
# 4: d  13  14  15  16  17
# 5: e  14  15  16  17  18
# 6: f  15  16  17  18  19

(我不确定这是否完全延伸到其他情况。很难说,因为OP的例子确实不需要merge的全部功能。在OP的情况下,mycols="x"并且x在所有DT*中是相同的,显然合并是不合适的,正如@eddi所提到的那样。但是,一般的问题很有趣,所以这就是我在这里试图攻击的。)< / p>

答案 3 :(得分:2)

使用重塑为您提供了更加灵活的列命名方式。

library(dplyr)
library(tidyr)

list(DT1, DT2, DT3, DT4, DT5) %>%
  bind_rows(.id = "source") %>%
  mutate(source = paste("y", source, sep = ".")) %>%
  spread(source, y)

或者,这会起作用

library(dplyr)
library(tidyr)

list(DT1 = DT1, DT2 = DT2, DT3 = DT3, DT4 = DT4, DT5 = DT5) %>%
  bind_rows(.id = "source") %>%
  mutate(source = paste(source, "y", sep = ".")) %>%
  spread(source, y)

答案 4 :(得分:2)

另一种方法:

dts <- list(DT1, DT2, DT3, DT4, DT5)

names(dts) <- paste("y", seq_along(dts), sep="")
data.table::dcast(rbindlist(dts, idcol="id"), x ~ id, value.var = "y")

#   x y1 y2 y3 y4 y5
#1: a 10 11 12 13 14
#2: b 11 12 13 14 15
#3: c 12 13 14 15 16
#4: d 13 14 15 16 17
#5: e 14 15 16 17 18
#6: f 15 16 17 18 19

&#34; data.table :: dcast&#34;中的包名称添加以确保调用返回数据表而不是数据帧,即使&#34; reshape2&#34;包也加载了。如果没有明确提到包名,可以使用reshape2包中的dcast函数,该函数适用于data.frame并返回data.frame而不是data.table。

答案 5 :(得分:1)

或者,您可以setNames之前的列,然后像这样merge

dts = list(DT1, DT2, DT3, DT4, DT5)
names(dts) = paste('DT', c(1:5), sep = '')    

dtlist = lapply(names(dts),function(i) 
         setNames(dts[[i]], c('x', paste('y',i,sep = '.'))))

Reduce(function(...) merge(..., all = T), dtlist)

#   x y.DT1 y.DT2 y.DT3 y.DT4 y.DT5
#1: a    10    11    12    13    14
#2: b    11    12    13    14    15
#3: c    12    13    14    15    16
#4: d    13    14    15    16    17
#5: e    14    15    16    17    18
#6: f    15    16    17    18    19