Question

我正在尝试使用R中的aruleSequences进行顺序模式分析。

删除所有类型的重复项后，我的数据集有626,047行。它有3列。遗憾的是，我无法将数据集放在这里。我在谷歌表中创建了样本数据，以便了解数据的外观。它是here。数据命名为df_sq

它有3列：

类数字的Numeric_id。这是user_id
产品 - 类别因素。
时间 - 类整数

我已经能够在＆＃39;交易＆＃39;中转换数据。格式根据包。但是在运行cSpade时，我收到以下错误：

Error in makebin(data, file) : 'eid' invalid (strict order)

现在，通过阅读Stackoverflow上的其他问题，我知道这意味着我必须对数据进行排序。所以我回去按照numeric_id和时间对我的原始数据进行排序。反之亦然。并将数据重新转换为“交易”＃39;格式并重新运行cSpade。

我仍然遇到同样的错误。

之前是否有人使用此套餐？

这是我用过的代码：

library(arules)
library(arulesViz)
library(arulesSequences)
library(sqldf)

df_sq = read.csv("service_data.csv", stringsAsFactors = FALSE)

#Changing class of timestamp column and coercing product name to factor

df_sq$time1 = as.integer(as.numeric(df_sq$time1))
df_sq$service_name = as.factor(df_sq$service_name)

#Clearing duplicates

df_sq = sqldf("select distinct numeric_id, service_name, time1 
               from df_sq")

#Ordering the dataset on numeric id and time

df_sq = df_sq3[order(df_sq3$numeric_id, df_sq3$time1),] 
df_sq = df_sq3[order(df_sq3$time1),]
df_sq = df_sq3[order(df_sq3$sequenceID),]

#Coverting to transactional format per the package

sq_data = data.frame(item=df_sq3$service_name)
sq_tran = as(sq_data, "transactions")
transactionInfo(sq_tran)$sequenceID = df_sq3$numeric_id
transactionInfo(sq_tran)$eventID = df_sq3$time1

summary(sq_tran)

#Running cSpade

s1 = cspade(sq_tran, parameter = list(support = 0.1), control = list(verbose 
= TRUE),tmpdir = tempdir())

summary(s1)

R

0 个答案: