根据名称在表上重复列

时间:2018-04-26 04:38:29

标签: r data.table

我获得了一个Excel史诗的例子。宽表说明了产品对(在行中)和机器(在列中)的容量。该表看起来类似于下一个可重复示例中的表(请注意使用data.tabledata.frame / tidyverse解决方案是受欢迎的,但首选data.table个解决方案):

a <- data.table(names = c("product 1", "product 2"), "9-10" = c(1, 5), "21-23" = c(3, 2))

> a
       names 9-10 21-23
1: product 1    1     3
2: product 2    5     2

问题是“9-10”意味着机器9和10具有相同的容量(分别为产品1和2的1和5)。我正在寻找一种方法,以一个看起来像b的表结束:

> b
       names 9 10 21 23
1: product 1 1  1  3  3
2: product 2 5  5  2  2

我用以下代码实现了它:

for (i in unlist(strsplit(names(a)[2:3], split = "-", fixed = TRUE))){
    a[, print(i) := .SD, .SDcols = grep(paste0(i, "\\b"), names(a)[2:3], value = TRUE)]
}

a[, names(a)[2:3] := NULL]

我想知道什么是更干净的方法。

3 个答案:

答案 0 :(得分:4)

使用data.table我们可以创建一个索引和子集,然后调整名称。

# data
a <- data.table(names = c("product 1", "product 2"),
                "9-10" = c(1, 5),
                "21-23" = c(3, 2))


# names split
name_pos <- strsplit(names(a), split = "-")
# create index for subsetting based on name_pos
index <- rep(seq_along(name_pos), times = lengths(name_pos))

# index and adjust names
a_final <- a[, ..index]
# thanks to Frank for suggestion
setnames(a_final, unlist(name_pos))

答案 1 :(得分:4)

data.table的另一种可能性:

melt(a, id = 1)[, unlist(tstrsplit(variable,'-')), by = .(names, value)
                ][, dcast(.SD, names ~ V1)]

给出:

       names 10 21 23 9
1: product 1  1  3  3 1
2: product 2  5  2  2 5

答案 2 :(得分:1)

解决方案是将tidyr用作:

library(tidyr)
library(dplyr)
a %>% gather(variable, value, -names) %>% 
  separate(variable, c("col1","col2")) %>% mutate(value2 = value) %>%
  spread(col1, value)  %>% spread(col2, value2) %>%
  group_by(names) %>%
  summarise_all(sum,na.rm = TRUE) %>%
  as.data.frame()
#       names 21 9 10 23
# 1 product 1  3 1  1  3
# 2 product 2  2 5  5  2