Question

这里已经解决了收集多组列的问题：Gather multiple sets of columns，但在我的情况下，列并不是唯一的。

我有以下数据：

input <- data.frame(
  id = 1:2, 
  question = c("a", "b"),
  points = 0,
  max_points = c(3, 5),
  question = c("c", "d"),
  points = c(0, 20),
  max_points = c(5, 20),
  check.names = F,
  stringsAsFactors = F
)
input
#>   id question points max_points question points max_points
#> 1  1        a      0          3        c      0          5
#> 2  2        b      0          5        d     20         20

第一列是id，然后我有很多重复的列（原始数据集有133列）：

问题的标识符
给出的要点
最高分

我想结束这个结构：

expected <- data.frame(
  id = c(1, 2, 1, 2),
  question = letters[1:4],
  points = c(0, 0, 0, 20),
  max_points = c(3, 5, 5, 20),
  stringsAsFactors = F
)
expected
#>   id question points max_points
#> 1  1        a      0          3
#> 2  2        b      0          5
#> 3  1        c      0          5
#> 4  2        d     20         20

我尝试了几件事：

tidyr::gather(input, key, val, -id)
reshape2::melt(input, id.vars = "id")

两者都无法提供所需的输出。此外，由于列数多于此处显示的列数，gather不再起作用，因为重复的列太多了。

作为解决方法，我尝试了这个：

# add numbers to make col headers "unique"
names(input) <- c("id", paste0(1:(length(names(input)) - 1), names(input)[-1]))

# gather, remove number, spread
input %>% 
  gather(key, val, -id) %>%
  mutate(key = stringr::str_replace_all(key, "[:digit:]", "")) %>%
  spread(key, val)

会出错：Duplicate identifiers for rows (3, 9), (4, 10), (1, 7), (2, 8)

此问题已在此处讨论过：Unexpected behavior with tidyr，但我不知道为什么/如何添加其他标识符。很可能这不是主要问题，因为我可能应该以不同的方式处理整个问题。

我怎样才能解决我的问题，最好是tidyr或基数？我不知道如何使用data.table，但如果有一个简单的解决方案，我也会满意。

Answer 1

试试这个：

do.call(rbind,
        lapply(seq(2, ncol(input), 3), function(i){
          input[, c(1, i:(i + 2))]
              })
        )

#   id question points max_points
# 1  1        a      0          3
# 2  2        b      0          5
# 3  1        c      0          5
# 4  2        d     20         20

Answer 2

在data.table中执行此操作的惯用方法非常简单：

library(data.table)
setDT(input)

res = melt(
  input, 
  id = "id", 
  meas = patterns("question", "^points$", "max_points"), 
  value.name = c("question", "points", "max_points")
)


   id variable question points max_points
1:  1        1        a      0          3
2:  2        1        b      0          5
3:  1        2        c      0          5
4:  2        2        d     20         20

你得到了名为＆＃34;变量＆＃34;的额外列，但如果需要，你可以在res[, variable := NULL]之后删除它。

Answer 3

在不使用lapply的情况下实现相同目标的另一种方法：

我们首先抓住所有问题，max_points和点的列，然后我们将每个列单独融合并cbind将它们全部组合在一起。

library(reshape2)

questions <- input[,c(1,c(1:length(names(input)))[names(input)=="question"])]
points <- input[,c(1,c(1:length(names(input)))[names(input)=="points"])]
max_points <- input[,c(1,c(1:length(names(input)))[names(input)=="max_points"])]

questions_m <- melt(questions,id.vars=c("id"),value.name = "questions")[,c(1,3)]
points_m <- melt(points,id.vars=c("id"),value.name = "points")[,3,drop=FALSE]
max_points_m <- melt(max_points,id.vars=c("id"),value.name = "max_points")[,3, drop=FALSE]

res <- cbind(questions_m,points_m, max_points_m)
res
  id questions points max_points
1  1         a      0          3
2  2         b      0          5
3  1         c      0          5
4  2         d     20         20

Answer 4

您可能需要澄清您希望如何处理ID列，但可能是这样的？

runme <- function(word , dat){
     grep( paste0("^" , word , "$") , names(dat)) 
}

l <- mapply( runme ,  unique(names(input)) , list(input) )
l2 <- as.data.frame(l)

output <- data.frame()
for (i in 1:nrow(l2)) output <- rbind( output , input[,  as.numeric(l2[i,])  ])

不确定处理不同数量的重复列有多强大，但它适用于您的测试数据，如果您的列重复次数相同，则应该有效。

将重复的列集合集成单个列

4 个答案: