我有一个像这样创建的序列对象:
subsequences <- function(data){
slmax <- max(data$time)
sequences.seqe <- seqecreate(data)
sequences.sts <- seqformat(data, from="SPELL", to="DSS", begin="time", end="end", id="id", status="event", limit=slmax)
sequences.sts <- seqdef(sequences.sts, right = "DEL", left = "DEL", gaps = "DEL")
(sequences.sts)
}
data <- subsequences(data)
head(data)
给出了输出:
Sequence
[1] discussed-subscribed-*-discussed-*-discussed-*-discussed-*-discussed-*-closed
[2] *-opened-*-reviewed-*-discussed-*-discussed-*-discussed-*-merged
[3] *-discussed-*-discussed-*-discussed-*-discussed
[4] *-opened-*-discussed-merged-discussed
[5] *-discussed-*-referenced-discussed-closed-discussed-referenced-discussed
[6] *-referenced-*-referenced-*-referenced-assigned-*-closed
但是当我计算子序列时,我得到了看似荒谬的答案:
seqsubsn(head(data))
[!] found missing state in the sequence(s), adding missing state to the alphabet
Subseq.
[1] 1036
[2] 1248
[3] 88
[4] 49
[5] 294
[6] 240
子序列的数量如何远远超过每个序列中的事件数?
可以找到数据集的'dput()'here。问题似乎是原始数据在几秒钟内有时间戳。但是,我已经使用下面的函数将时间戳更改为简单的顺序:
read_seqdata <- function(data, startdate, stopdate){
data <- read.table(data, sep = ",", header = TRUE)
data <- subset(data, select = c("pull_req_id", "action", "created_at"))
colnames(data) <- c("id", "event", "time")
data <- sqldf(paste0("SELECT * FROM data WHERE strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') >= '",startdate,"' AND strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') <= '",stopdate,"'"))
data$end <- data$time
data <- data[with(data, order(time)), ]
data$time <- match( data$time , unique( data$time ) )
data$end <- match( data$end , unique( data$end ) )
slmax <- max(data$time)
(data)
}
这使得有可能为熵,序列长度等创建适当的度量,但子序列的数量仍然存在问题。
答案 0 :(得分:2)
返回的子序列数量根本不足为奇。这是“子序列”的定义问题,不应与“子串”混淆。
序列$ x =(x_1,x_2,...,x_3)$是$ y $的子序列,如果其元素$ x_i $全部在$ y $中,并且与$ y $中的顺序相同。例如,A-B-A是C-A-D-B-C-D-A-D的子序列。
为了说明,请考虑TraMineR包中的“mvad”示例。
library(TraMineR)
data(mvad)
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")
mvad.seq <- seqdef(mvad, 17:86, states = mvad.scodes)
print(mvad.seq[1:3,], format="SPS")
## Sequence
##[1] (EM,4)-(TR,2)-(EM,64)
##[2] (FE,36)-(HE,34)
##[3] (TR,24)-(FE,34)-(EM,10)-(JL,2)
seqsubsn(mvad.seq)[1:3]
##[1] 7 4 16
默认情况下,seqsubsn
计算不同连续状态(DSS)的子序列数。例如,第一序列的DSS是EM-TR-EM。 EM-TR-EM的七个子序列是:
以同样的方式验证您的第四个序列(等于其DSS)
*-opened-*-discussed-merged-discussed
有49个子序列,其中9个是两个长度的子序列:
*-open
,*-discussed
,*-merged
,
opened-*
,opened-discussed
,opened-merged
,
discussed-merged
,discussed-discussed
,
merged-discussed
希望这有帮助