Question

In the R environment; Let say I have a data set similar to the one in below:

ID  Activity
1   a
1   b
2   a
3   c
2   a
1   c
4   a
4   b
3   b
4   c

As you can see each ID has a sequence of activities. What is important to consider is the number of times an activity is being followed by the other ones. The results I am looking for are: 1. Discovering existing variants in the dataset (the existing sequence for each ID): like: `

   <a,b, c> : id: 1 & 4
   <a,a>    : id: 2
   <c,b>    : id:3

A matrix as following which shows the number of times an activity is being followed by the other one: like:

:a b c a 1 2 0 b 0 0 1 c 0 1 0

Thank you for your help.

Answer 1

这是data.table的解决方案

library(data.table)
dt <- data.table(ID=c(1,1,2,3,2,1,4,4,3,4),Activity=c("a","b","a","c","a","c","a","b","b","c"))

个ID：

dt[,.(seq=paste(Activity,collapse = ",")),ID][,.(ids=paste(ID,collapse = ",")),seq]

我们可以快速得到答案：

consecutive_id <- dt[,.(first=(Activity),second=(shift(Activity,type = "lead"))),ID][!is.na(second)]
consecutive <- consecutive_id[,.N,.(first,second)]

但是如果您需要矩阵形式的内容，则需要执行一些额外的步骤：

classes <- dt[,unique(Activity)];n <- length(classes)
M_consecutive <- data.table(matrix(0,nrow = n,ncol=n))
setnames(M_consecutive,classes)
M_consecutive$classes <- classes; setkey(M_consecutive,classes)
for(i in 1:nrow(consecutive)) M_consecutive[consecutive[i]$first,(consecutive[i]$second):=consecutive[i]$N]
M_consecutive

Discovering the dependency relations among the samples of a data set

1 个答案: