我一直在尝试在我的交易文件中有大约7百万条记录的数据集上使用cSPADE(700万条唯一的sequenceID x eventID对)。当我尝试在此数据集上运行cSPADE时得到的支持结果似乎完全错误。但是,当我使用~86,000条记录(前一个文件的头部,或多或少)时,结果看起来正确。我注意到,到目前为止,详细日志打印出只使用了1个分区,而当我尝试~850,000条记录时,使用了3个分区。
使用100,000条记录时的详细输出(结果合理):
> s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control = list(verbose = TRUE))
parameter specification:
support : 0.1
maxsize : 10
maxlen : 1
algorithmic control:
bfstype : FALSE
verbose : TRUE
summary : FALSE
tidLists : FALSE
preprocessing ... 1 partition(s), 1.98 MB [0.7s]
mining transactions ... 0 MB [0.21s]
reading sequences ... [0.03s]
total elapsed time: 0.94s
> summary(s1)
set of 14 sequences with
most frequent items:
A B C D E (Other)
2 2 1 1 1 8
.
.
.
summary of quality measures:
support
Min. :0.1306
1st Qu.:0.3701
Median :0.7021
Mean :0.5773
3rd Qu.:0.7184
Max. :0.9903
includes transaction ID lists: FALSE
mining info:
data ntransactions nsequences support
trans 83686 10059 0.1
使用1000,000条记录时的详细输出(结果看错了):
> s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control =
list(verbose = TRUE))
parameter specification:
support : 0.1
maxsize : 10
maxlen : 1
algorithmic control:
bfstype : FALSE
verbose : TRUE
summary : FALSE
tidLists : FALSE
preprocessing ... 3 partition(s), 19.55 MB [4.6s]
mining transactions ... 0 MB [0.6s]
reading sequences ... [0.01s]
total elapsed time: 5.19s
> summary(s1)
set of 0 sequences with
most frequent items:
integer(0)
most frequent elements:
integer(0)
element (sequence) size distribution:
< table of extent 0 >
sequence length distribution:
< table of extent 0 >
summary of quality measures:
< table of extent 0 >
includes transaction ID lists: FALSE
mining info:
data ntransactions nsequences support
trans 826830 96238 0.1
我发现在调用cSPADE时我可以将分区数设置为1并修复了问题。但是,cSPADE会输出警告:
s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control = list(verbose = TRUE,numpart=1))
Warning message: In cspade(trans, parameter = list(support = 0.1, maxlen = 1), control = list(verbose = TRUE, : 'numpart' less than recommended
我需要留意这个警告吗?设置numpart = 1(强制#partitions为1)的缺点是什么?如果有的话,有没有办法让我在不控制这个参数的情况下得到正确的答案?
答案 0 :(得分:4)
为了可能遇到同样问题的其他人的利益。我最后通过电子邮件向作者发送了包裹。他说这不是一个已知的问题,并建议我坚持numpart = 1。