Discovering the dependency relations among the samples of a data set

时间:2018-12-04 18:28:46

标签: r process

In the R environment; Let say I have a data set similar to the one in below:

ID  Activity
1   a
1   b
2   a
3   c
2   a
1   c
4   a
4   b
3   b
4   c

As you can see each ID has a sequence of activities. What is important to consider is the number of times an activity is being followed by the other ones. The results I am looking for are: 1. Discovering existing variants in the dataset (the existing sequence for each ID): like: `

   <a,b, c> : id: 1 & 4
   <a,a>    : id: 2
   <c,b>    : id:3
  1. A matrix as following which shows the number of times an activity is being followed by the other one: like:

    :a b c a 1 2 0 b 0 0 1 c 0 1 0

Thank you for your help.

1 个答案:

答案 0 :(得分:0)

这是data.table的解决方案

library(data.table)
dt <- data.table(ID=c(1,1,2,3,2,1,4,4,3,4),Activity=c("a","b","a","c","a","c","a","b","b","c"))
    每个序列的
  1. 个ID:

    dt[,.(seq=paste(Activity,collapse = ",")),ID][,.(ids=paste(ID,collapse = ",")),seq]
    
  2. 我们可以快速得到答案:

    consecutive_id <- dt[,.(first=(Activity),second=(shift(Activity,type = "lead"))),ID][!is.na(second)]
    consecutive <- consecutive_id[,.N,.(first,second)]
    

    但是如果您需要矩阵形式的内容,则需要执行一些额外的步骤:

    classes <- dt[,unique(Activity)];n <- length(classes)
    M_consecutive <- data.table(matrix(0,nrow = n,ncol=n))
    setnames(M_consecutive,classes)
    M_consecutive$classes <- classes; setkey(M_consecutive,classes)
    for(i in 1:nrow(consecutive)) M_consecutive[consecutive[i]$first,(consecutive[i]$second):=consecutive[i]$N]
    M_consecutive