Question

我希望在数据框（integrates2）中添加一列，按顺序计算重复项。以下是数据的样子：

name    program  date of contact   helper column
John     ffp        10/11/2014          2
John     TP         10/27/2014          2
Carlos   TP         11/19/2015          3
Carlos   ffp        12/1/2015           3
Carlos   wfd        12/31/2015          3
Jen      ffp        9/9/2014            2
Jen      TP         9/30/2014           2

这是在某些日期参加某些课程的人员列表。我添加了一个辅助列来计算重复项并对联系日期进行排序。我期待计算存在的程序组合（例如ffp-tp，tp-ffp-wfd）。

为了做到这一点，我想实现以下代码，以便在名为“program2”的新列的帮助下转换有序组合：

 #transpose the programs 
 require(reshape2) dcast(integrates2, name ~ program2, value.var=”program”)

然后我计划使用以下代码将结果转换为表格和数据框并计算频率：

 res = table(integrates2)
 resdf = as.data.frame(res)

我在以下链接中看到了这个： Count number of time combination of events appear in dataframe columns ext

“program2”需要的是：

  Name    program  date of contact   helper column   program2
  John     ffp        10/11/2014          2             1
  John     TP         10/27/2014          2             2
  Carlos   TP         11/19/2015          3             1
  Carlos   ffp        12/1/2015           3             2
  Carlos   wfd        12/31/2015          3             3

这样，我可以使用“program2”转换到不同的列，然后计算组合。最终结果应如下所示：

    program  pro1   pro2   freq      
     ffp     tp             2   
     TP      ffp    wfd     1

我确信有更简单的方法可以做到这一点，但正如我所知，这就是我的所在。感谢帮助人员！

Answer 1

在考虑了这个问题之后，我认为以下是可行的方法。如果您不介意组合所有程序名称，则可以执行以下操作。这可能要好得多。

setDT(mydf)[, list(type = paste(program, collapse = "-")), by = name][,
           list(total = .N), by = type]

#         type total
#1:     ffp-TP     2
#2: TP-ffp-wfd     1

如果您想要分隔程序名称，可以使用cSplit()包中的splitstackshape执行此操作。

setDT(mydf)[, list(type = paste(program, collapse = "-")), by = name][,
              list(total = .N), by = type] -> temp

cSplit(temp, splitCols = "type", sep = "-")

#   total type_1 type_2 type_3
#1:     2    ffp     TP     NA
#2:     1     TP    ffp    wfd

dplyr代码的等价性是：

group_by(mydf, name) %>%
summarise(type = paste(program, collapse = "-")) %>%
count(type)

#        type     n
#       (chr) (int)
#1     ffp-TP     2
#2 TP-ffp-wfd     1

数据

mydf <- structure(list(name = c("John", "John", "Carlos", "Carlos", "Carlos", "Jen", "Jen"), program = c("ffp", "TP", "TP", "ffp", "wfd", "ffp", "TP"), dateOfContact = c("10/11/2014", "10/27/2014", "11/19/2015", "12/1/2015", "12/31/2015", "9/9/2014", "9/30/2014"), helperColumn = c(2L, 2L, 3L, 3L, 3L, 2L, 2L)), .Names = c("name", "program", "dateOfContact", "helperColumn"), class = "data.frame", row.names = c(NA, -7L))

Answer 2

编辑：返回排列

使用dplyr，

library(dplyr)
integrates2 %>% group_by(name) %>% summarise(prg1 = program[1],
                                             prg2 = program[2],
                                             prg3 = program[3]) %>% 
  select(prg1, prg2, prg3) %>% group_by(prg1, prg2, prg3) %>% summarise(freq = n())

返回

Source: local data frame [2 x 4]
Groups: prg1, prg2 [?]

    prg1   prg2   prg3  freq
  (fctr) (fctr) (fctr) (int)
1    ffp     TP     NA     2
2     TP    ffp    wfd     1

使用评论中的mydf2，它会生成

Source: local data frame [3 x 4]
Groups: prg1, prg2 [?]

   prg1  prg2  prg3  freq
  (chr) (chr) (chr) (int)
1   ffp    TP    NA     1
2    TP   ffp    NA     1
3   wfd    TP   ffp     1

链

在group_by上调用name来分隔案件;
summarise将program转换为三列;
select缩小到这些列;
group_by所有prg*列
summarise可以将其删除为唯一群组，并在这些群组中添加freq次出现次数。

或者，如果你愿意，你可以在基础R中完成整个过程，尽管它的可读性相当低（至少使用这种特殊方法）：

tab <- table(sapply(split(integrages2$program, integrates2$name), 
             function(x){paste(x, collapse = '-')}))
prgs <- strsplit(names(tab), '-')
programs <- do.call(rbind, lapply(prgs, function(x){
  c(x, rep(NA, max(sapply(prgs, length)-length(x))))
  }))
programs <- cbind(as.data.frame(programs), matrix(tab))
names(programs) <- c(paste0('prgm', seq(length(programs)-1)), 'freq')

一个非常快速和肮脏的版本，它将系列折叠成字符串：

table(sapply(split(integrates2$program, integrates2$name), 
             function(x){paste(x, collapse = '-')}))

返回

ffp-TP TP-ffp-wfd 
     2          1

或包含在as.matrix，

中

           [,1]
ffp-TP        2
TP-ffp-wfd    1

预编辑版本：返回组合

使用reshape2，您可以使用dcast制作一个程序组合的数据框（使用{{1}切出我们不关心的name }）：

[,-1]

library(reshape2) programs <- dcast(integrates2, name ~ program, value.var = 'program')[,-1]看起来像：

programs

现在，您可以使用> programs ffp TP wfd 1 ffp TP wfd 2 ffp TP <NA> 3 ffp TP <NA>按dplyr的所有列名进行分组（在此处以编程方式完成，但如果您希望看到正在进行的操作，则可以手动执行programs on）和group_by(ffp, TP, wfd)，使用summarise来获取组中行数的计数：

n()

返回

library(dplyr)
programs %>% group_by_(.dots = names(programs)) %>% summarise(freq = n())

如何添加按顺序计算重复项的列？

2 个答案:

编辑：返回排列

预编辑版本：返回组合