In the R environment; Let say I have a data set similar to the one in below:
ID Activity
1 a
1 b
2 a
3 c
2 a
1 c
4 a
4 b
3 b
4 c
As you can see each ID has a sequence of activities. What is important to consider is the number of times an activity is being followed by the other ones. The results I am looking for are: 1. Discovering existing variants in the dataset (the existing sequence for each ID): like: `
<a,b, c> : id: 1 & 4
<a,a> : id: 2
<c,b> : id:3
A matrix as following which shows the number of times an activity is being followed by the other one: like:
:a b c
a 1 2 0
b 0 0 1
c 0 1 0
Thank you for your help.
答案 0 :(得分:0)
这是data.table的解决方案
library(data.table)
dt <- data.table(ID=c(1,1,2,3,2,1,4,4,3,4),Activity=c("a","b","a","c","a","c","a","b","b","c"))
个ID:
dt[,.(seq=paste(Activity,collapse = ",")),ID][,.(ids=paste(ID,collapse = ",")),seq]
我们可以快速得到答案:
consecutive_id <- dt[,.(first=(Activity),second=(shift(Activity,type = "lead"))),ID][!is.na(second)]
consecutive <- consecutive_id[,.N,.(first,second)]
但是如果您需要矩阵形式的内容,则需要执行一些额外的步骤:
classes <- dt[,unique(Activity)];n <- length(classes)
M_consecutive <- data.table(matrix(0,nrow = n,ncol=n))
setnames(M_consecutive,classes)
M_consecutive$classes <- classes; setkey(M_consecutive,classes)
for(i in 1:nrow(consecutive)) M_consecutive[consecutive[i]$first,(consecutive[i]$second):=consecutive[i]$N]
M_consecutive