Question

我从包含时间，客户端，项目和 - 可能 - 描述项目的多个标签的变量的项目上花费的API跟踪时间中提取数据。但是，当我提取数据时，具有多个标记的条目会复制到其他相同的行中，每行只有一个唯一标记，如下所示：

 duration client project    tag
       60      A       X  first
       45      B       Y second
       45      B       Y  third
       30      C       Z fourth

如何在组合标签时删除重复的行？我想是这样的：

A)
  duration client project    tags
1       60      A       X   first
2       45      B       Y  second, third
3       30      C       Z  fourth

或者这个：

B)
  duration client project    tag1   tag2
1       60      A       X   first     NA
2       45      B       Y  second  third
3       30      C       Z  fourth     NA

我还希望建议哪些建议安排（A或B）最适合能够快速总结项目花费的时间，例如标签“第一”和“第三”（例如105分钟））？

以下是示例数据框：

df <- data.frame(
  duration = c(60, 45, 45, 30),
  client = c("A", "B", "B", "C"),
  project = c("X", "Y", "Y", "Z"),
  tag = c("first", "second", "third", "fourth")
  )

我感谢任何建议（我觉得这对dplyr / tidyr来说不应该太难，但还是不能完全正确）。谢谢！

Answer 1

我们可以使用dplyr作为输出A. group_by_at(vars(-tag))是一种指定分组变量应该是除tag之外的所有列的方法，因为您希望所有其他列完全重复行。

library(dplyr)

df2 <- df %>%
  group_by_at(vars(-tag)) %>%
  summarise(tags = toString(tag)) %>%
  ungroup()
df2
# # A tibble: 3 x 4
#   duration client project          tags
#      <dbl> <fctr>  <fctr>         <chr>
# 1       30      C       Z        fourth
# 2       45      B       Y second, third
# 3       60      A       X         first

然后我们可以使用splitstackshape作为输出B

library(splitstackshape)
df3 <- df2 %>% cSplit(splitCols = "tags")
df3
#    duration client project tags_1 tags_2
# 1:       30      C       Z fourth     NA
# 2:       45      B       Y second  third
# 3:       60      A       X  first     NA

Answer 2

您的解决方案A对我来说很好看。我会这样做： -

library(data.table)

setDT(df)
df <- df[, tags := paste0(tag, collapse = ", "), by = project]
df[, tag := NULL]
df <- unique(df)

它会在A方法中为您提供您想要的结果：

duration client project   tags
1:  60      A       X     first
2:  45      B       Y     second, third
3:  30      C       Z     fourth

Answer 3

我会用plyr代替A）

library(plyr)
df2 <- ddply(df, .(client), function(df){
  tags<- paste(df$tag, collapse=",")
  df$tag <- tags
  df[1,]
})

在其他相同的行中组合单个唯一变量

3 个答案: