奇怪的子序列数?

时间:2013-12-21 10:02:55

标签: traminer

我有一个像这样创建的序列对象:

subsequences <- function(data){
  slmax <- max(data$time)
  sequences.seqe <- seqecreate(data)
  sequences.sts <- seqformat(data, from="SPELL", to="DSS", begin="time", end="end", id="id", status="event", limit=slmax)
  sequences.sts <- seqdef(sequences.sts, right = "DEL", left = "DEL", gaps = "DEL")
  (sequences.sts)
}

data <- subsequences(data)

head(data)

给出了输出:

    Sequence                                                                     
[1] discussed-subscribed-*-discussed-*-discussed-*-discussed-*-discussed-*-closed
[2] *-opened-*-reviewed-*-discussed-*-discussed-*-discussed-*-merged             
[3] *-discussed-*-discussed-*-discussed-*-discussed                              
[4] *-opened-*-discussed-merged-discussed                                        
[5] *-discussed-*-referenced-discussed-closed-discussed-referenced-discussed     
[6] *-referenced-*-referenced-*-referenced-assigned-*-closed

但是当我计算子序列时,我得到了看似荒谬的答案:

seqsubsn(head(data))
 [!] found missing state in the sequence(s), adding missing state to the alphabet
    Subseq.
[1]    1036
[2]    1248
[3]      88
[4]      49
[5]     294
[6]     240

子序列的数量如何远远超过每个序列中的事件数?

可以找到数据集的'dput()'here。问题似乎是原始数据在几秒钟内有时间戳。但是,我已经使用下面的函数将时间戳更改为简单的顺序:

read_seqdata <- function(data, startdate, stopdate){
  data <- read.table(data, sep = ",", header = TRUE)
  data <- subset(data, select = c("pull_req_id", "action", "created_at"))
  colnames(data) <- c("id", "event", "time")
  data <- sqldf(paste0("SELECT * FROM data WHERE strftime('%Y-%m-%d', time,
    'unixepoch', 'localtime') >= '",startdate,"' AND strftime('%Y-%m-%d', time,
    'unixepoch', 'localtime') <= '",stopdate,"'"))
  data$end <- data$time
  data <- data[with(data, order(time)), ]
  data$time <- match( data$time , unique( data$time ) )
      data$end <- match( data$end , unique( data$end ) )
  slmax <- max(data$time)
  (data)
}

这使得有可能为熵,序列长度等创建适当的度量,但子序列的数量仍然存在问题。

1 个答案:

答案 0 :(得分:2)

返回的子序列数量根本不足为奇。这是“子序列”的定义问题,不应与“子串”混淆。

序列$ x =(x_1,x_2,...,x_3)$是$ y $的子序列,如果其元素$ x_i $全部在$ y $中,并且与$ y $中的顺序相同。例如,A-B-A是C-A-D-B-C-D-A-D的子序列。

为了说明,请考虑TraMineR包中的“mvad”示例。

library(TraMineR)
data(mvad)
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")
mvad.seq <- seqdef(mvad, 17:86, states = mvad.scodes)
print(mvad.seq[1:3,], format="SPS")

##    Sequence                      
##[1] (EM,4)-(TR,2)-(EM,64)         
##[2] (FE,36)-(HE,34)               
##[3] (TR,24)-(FE,34)-(EM,10)-(JL,2)

seqsubsn(mvad.seq)[1:3]

##[1]  7  4 16

默认情况下,seqsubsn计算不同连续状态(DSS)的子序列数。例如,第一序列的DSS是EM-TR-EM。 EM-TR-EM的七个子序列是:

  • 空序列
  • 由单个元素组成的两个序列:EM和TR
  • 两个长度的子序列:EM-TR,EM-EM,TR-EM
  • 三长序列:EM-TR-EM

以同样的方式验证您的第四个序列(等于其DSS)

*-opened-*-discussed-merged-discussed

有49个子序列,其中9个是两个长度的子序列:

*-open*-discussed*-mergedopened-*opened-discussedopened-mergeddiscussed-mergeddiscussed-discussedmerged-discussed

希望这有帮助