我正在尝试使用R中的aruleSequences进行顺序模式分析。
删除所有类型的重复项后,我的数据集有626,047行。它有3列。遗憾的是,我无法将数据集放在这里。我在谷歌表中创建了样本数据,以便了解数据的外观。它是here。数据命名为df_sq
它有3列:
我已经能够在'交易'中转换数据。格式根据包。但是在运行cSpade时,我收到以下错误:
Error in makebin(data, file) : 'eid' invalid (strict order)
现在,通过阅读Stackoverflow上的其他问题,我知道这意味着我必须对数据进行排序。 所以我回去按照numeric_id和时间对我的原始数据进行排序。反之亦然。并将数据重新转换为“交易”#39;格式并重新运行cSpade。
我仍然遇到同样的错误。
之前是否有人使用此套餐?
这是我用过的代码:
library(arules)
library(arulesViz)
library(arulesSequences)
library(sqldf)
df_sq = read.csv("service_data.csv", stringsAsFactors = FALSE)
#Changing class of timestamp column and coercing product name to factor
df_sq$time1 = as.integer(as.numeric(df_sq$time1))
df_sq$service_name = as.factor(df_sq$service_name)
#Clearing duplicates
df_sq = sqldf("select distinct numeric_id, service_name, time1
from df_sq")
#Ordering the dataset on numeric id and time
df_sq = df_sq3[order(df_sq3$numeric_id, df_sq3$time1),]
df_sq = df_sq3[order(df_sq3$time1),]
df_sq = df_sq3[order(df_sq3$sequenceID),]
#Coverting to transactional format per the package
sq_data = data.frame(item=df_sq3$service_name)
sq_tran = as(sq_data, "transactions")
transactionInfo(sq_tran)$sequenceID = df_sq3$numeric_id
transactionInfo(sq_tran)$eventID = df_sq3$time1
summary(sq_tran)
#Running cSpade
s1 = cspade(sq_tran, parameter = list(support = 0.1), control = list(verbose
= TRUE),tmpdir = tempdir())
summary(s1)