早期问题
在this post中,我询问如何提取所谓的tidList,该tidList提供有关所发现的频繁序列是否存在于用于挖掘这些频繁序列的每个事务中的信息。更具体地说,如何提取布尔矩阵(表示序列的存在与否),使得行顺序与原始事务数据集中的相同?
最终,通过使用存储在类 sequences 的对象中的tidList的transactionInfo属性,这很容易做到。
新问题
这个问题与之前的问题略有不同:我怎样才能得分'给出一组频繁序列的新事务。即在给定类型为 sequences 的对象的情况下,如何从 transactions 类型的新对象中获取tidList类型的对象?
为了说明这一点,我设计了一个使用一些玩具数据集的例子:
library(arules)
library(arulesSequences)
library(stringr)
#Function used to convert character string into an object of type transactions.
#Source: https://github.com/cran/clickstream/blob/master/R/Clickstream.r.
as.transactions <- function(clickstreamList) {
transactionID <- unlist(lapply(seq(1, length(clickstreamList), 1), FUN =
function(x) rep(names(clickstreamList)[x], length(clickstreamList[[x]]))), use.names = F)
sequenceID <- unlist(lapply(seq(1, length(clickstreamList), 1), FUN =
function(x) rep(x, length(clickstreamList[[x]]))))
eventID <- unlist(lapply(clickstreamList, FUN = function(x)
1:length(x)), use.names = F)
transactionInfo <- data.frame(transactionID, sequenceID, eventID)
tr <- as(as.data.frame(unlist(clickstreamList, use.names = F)), "transactions")
transactionInfo(tr) <- transactionInfo
itemInfo(tr)$labels <- itemInfo(tr)$levels
return(tr)
}
#Dataset to mine frequent sequences from
data_mine_freq_seq <- data.frame(id = 1:10,
transaction = c("A B B A",
"A B C B D C B B B F A",
"A A B",
"B A B A",
"A B B B B",
"A A A B",
"A B B A B B",
"E F F A C B D A B C D E",
"A B B A B",
"A B"))
#Convert data to list containing character vectors
data_for_fseq_mining <- str_split(string = data_mine_freq_seq$transaction, pattern = " ")
#Include identifiers as names
names(data_for_fseq_mining) <- data_mine_freq_seq$id
#Convert to object of type transactions
data_for_fseq_mining_trans <- as.transactions(clickstreamList = data_for_fseq_mining)
#Mine frequent sequences with cspade, given some parameters.
sequences <- cspade(data = data_for_fseq_mining_trans,
parameter = list(support = 0.10, maxlen = 4, maxgap = 2),
control = list(tidList = TRUE, verbose = TRUE))
#Create a data frame that contains all sequences and their support (167 sequences in total).
sequences_df <- cbind(sequence = labels(sequences),
support = sequences@quality)
现在我创建一个只包含一个事务的新数据集:
data_score <- data.frame(id = 11, transaction = "A B B C D A")
#Convert data to list containing character vectors
data_score_list <- str_split(string = data_score$transaction, pattern = " ")
#Include identifier as name
names(data_score_list) <- data_score$id
#Convert to object of type transactions
data_score_trans <- as.transactions(clickstreamList = data_score_list)
如何找出对象序列中包含哪些频繁序列?&#39; data_score_trans&#39;?
修改
我尝试了以下代码:
supportingTransactions(x = sequences, transactions = data_score_trans)
产生预期和期望的结果:
tidLists in sparse format with
167 items/itemsets (rows) and
1 transactions (columns)
但是当新事务包含不在原始数据集中的元素时,会发生错误:
#Added a 'G' at the end of the transaction. Element 'G' is not an element in
#'data_mine_freq_seq'.
data_score <- data.frame(id = 11, transaction = "A B B C D A G")
#Convert data to list containing character vectors
data_score_list <- str_split(string = data_score$transaction, pattern = " ")
#Include identifier as name
names(data_score_list) <- data_score$id
#Convert to object of type transactions
data_score_trans <- as.transactions(clickstreamList = data_score_list)
#Score 'data_score_trans' using 'sequences' again:
supportingTransactions(x = sequences, transactions = data_score_trans)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
如何解决这个问题?
答案 0 :(得分:0)
我提出了一种利用正则表达式强大功能的解决方法。我定义了以下函数:
score_pattern <- function(pattern, events){
regex_elements <- str_extract_all(string = pattern, pattern = "\\{.*?\\}")
regex_elements <- str_replace_all(string = unlist(regex_elements),
pattern = "\\{|\\}", replacement = "")
expr <- ""
for(i in 1:length(regex_elements)){
if(i == 1){
expr <- paste0(expr, "(^| )", regex_elements[i], collapse = "")
} else {
expr <- paste0(expr, "( | .*? )", regex_elements[i], collapse = "")
}
}
expr <- paste0(expr, "( |$)", collapse = "")
print(expr)
score_pattern <- ifelse(test = grepl(pattern = expr, x = events) == TRUE,
yes = 1, no = 0)
return(score_pattern)
}
说明它的用途。这是一个例子,我使用对象'sequences_df'(从列'序列'中选择一个序列)和'data_score'中的事务数据,'transaction'列:
score_pattern(pattern = "<{B},{A}>", events = data_score$transaction)
[1] "(^| )B( | .*? )A( |$)"
[1] 1
该函数返回一个包含零和1的数字向量,指示序列是否存在于提供的事务中(1 =是,0 =否)。
虽然这是一种解决方案,但它仅适用于对序列中连续元素之间的最大间隙没有施加限制的情况。例如。创建的正则表达式没有'maxgap'参数。结论:只有在未设置cspade算法中的参数'maxgap'时,这才有效。