我的数据采用这种格式(longer, but still abbreviated, dataset can be found here):
pull_req_id,user,action,created_at
1679,NiGhTTraX,opened,1380104504
1678,akaariai,opened,1380044613
1678,akaariai,opened,1380044618
...
加载了以下库:
library(TraMineR)
library(sqldf)
我使用此功能加载它(很快):
read_seqdata <- function(data, startdate, stopdate){
data <- read.table(data, sep = ",", header = TRUE)
data <- subset(data, select = c("pull_req_id", "action", "created_at"))
colnames(data) <- c("id", "event", "time")
data <- sqldf(paste0("SELECT * FROM data WHERE strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') >= '",startdate,"' AND strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') <= '",stopdate,"'"))
data$end <- data$time
data <- data[with(data, order(time)), ]
data$time <- match(data$time, unique(data$time))
data$end <- match(data$end, unique(data$end))
(data)
}
project_sequences <- read_seqdata("/Users/name/github/local/data/event-data.txt",
'2012-01-01', '2012-06-30')
然后我运行此函数来计算序列长度(非常慢):
sequence_length <- function(data){
slmax <- max(data$time)
sequences.sts <- seqformat(data, from="SPELL", to="DSS", begin="time",
end="end", id="id", status="event", limit=slmax)
sequences.sts <- seqdef(sequences.sts, right = "DEL", left = "DEL",
gaps = "DEL")
sequences.length <- seqlength(sequences.sts)
(sequences.length)
}
project_length <- sequence_length(project_sequences)
然而,这是非常缓慢的。关于如何重构代码以加快速度的任何建议?
有些时间戳相隔数千步,但每个序列只有几步。不同序列的时间戳之间的距离是否会导致计算时间长(大学超级计算机超过20小时)?
答案 0 :(得分:1)
上面的read_seqdata
函数创建的时间戳似乎比原始的秒 - 自 - 纪元格式短,但仍然生成的时间戳差异高达50'000个单位。显然,这会显着减缓TraMineR
。我的解决方案是创建一个新函数来读取没有时间戳的数据:
read_seqdata_notime <- function(data, startdate, stopdate){
data <- read.table(data, sep = ",", header = TRUE)
data <- subset(data, select = c("pull_req_id", "action", "created_at"))
colnames(data) <- c("id", "event", "time")
data <- sqldf(paste0("SELECT * FROM data WHERE strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') >= '",startdate,"' AND strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') <= '",stopdate,"'"))
data.split <- split(data$event, data$id)
list.to.df <- function(arg.list) {
max.len <- max(sapply(arg.list, length))
arg.list <- lapply(arg.list, `length<-`, max.len)
as.data.frame(arg.list)
}
data <- list.to.df(data.split)
data <- t(data)
(data)
}
这大大加快了后续TraMineR
命令的计算速度,但是将序列分析限制为严格关于活动类型或排序的度量,并且不考虑持续时间(即长度,熵,子序列数,和不相似都可以使用)。
例如,用于在变量中存储序列长度的函数变为:
sequence_length <- function(data){
sequences.sts <- seqdef(data, left = "DEL", gaps = "DEL", right = "DEL")
sequences.length <- seqlength(sequences.sts)
(sequences.length)
}