如何将数据帧转换为R中序列挖掘的可用格式?

时间:2018-04-02 13:55:50

标签: r sequence arules

我想在R中进行序列分析,我正在尝试将我的数据转换为arulesSequences包的可用形式。

library(tidyverse)
library(arules)
library(arulesSequences)

df <- data_frame(personID = c(1, 1, 2, 2, 2),
             eventID = c(100, 101, 102, 103, 104),
             site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
             sequence = c(1, 2, 1, 2, 3))
df.trans <- as(df, "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
seq <- cspade(df.trans, parameter = list(support = 0.4), control = list(verbose = TRUE))

如果将我的列保留为上面的原始类,我会收到错误:

Error in asMethod(object) : 
  column(s) 1, 2, 3, 4 not logical or a factor. Discretize the columns first.

但是,如果我将列转换为因子,我会收到另一个错误:

df <- data_frame(personID = c(1, 1, 2, 2, 2),
             eventID = c(100, 101, 102, 103, 104),
             site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
             sequence = c(1, 2, 1, 2, 3))

df <- as.data.frame(lapply(df, as.factor))
df.trans <- as(df, "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
seq <- cspade(df.trans, parameter = list(support = 0.4), control = list(verbose = TRUE))

Error in asMethod(object) :
In makebin(data, file) : 'eventID' is a factor

对于解决此问题的任何建议或对R中序列挖掘的建议一般都非常感谢。谢谢!

1 个答案:

答案 0 :(得分:2)

只有实际项目(在您的情况下为“网站”)才能进入交易。始终检查您的中间结果,以确保它看起来正确。 ? cspade中描述了序列挖掘所需的事务类型。

library("arulesSequences")
df <- data.frame(personID = c(1, 1, 2, 2, 2),
             eventID = c(100, 101, 102, 103, 104),
             site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
             sequence = c(1, 2, 1, 2, 3))

# convert site into itemsets and add sequence and event ids
df.trans <- as(df[,"site", drop = FALSE], "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
inspect(df.trans)

# sort by sequenceID
df.trans <- df.trans[order(transactionInfo(df.trans)$sequenceID),]
inspect(df.trans)

# mine sequences
seq <- cspade(df.trans, parameter = list(support = 0.2), 
              control = list(verbose = TRUE))
inspect(seq)

希望这有帮助!