我正在尝试生成在apriori中使用的所有k项集,我遵循这个伪代码:
L1= {frequent items};
for (k= 2; Lk-1 !=∅; k++) do begin
Ck= candidates generated from Lk-1 (that is: cartesian product Lk-1 x Lk-1 and eliminating any
k-1 size itemset that is not frequent);
for each transaction t in database do
increment the count of all candidates in
Ck that are contained in t
Lk = candidates in Ck with min_sup
end
return U_k Lk;
,这是我的代码:
-- d transactions, threshold
kItemSets d thresh = kItemSets' 2 $ frequentItems d thresh
where
kItemSets' _ [] = [[]]
kItemSets' k t = ck ++ (kItemSets' (k+1) ck)
where
-- those (k-1) length sets that meet the threshold of being a subset of the transactions in d
ck = filter (\x->(countSubsets x d) >= thresh) $ combinations k t
-- length n combinations that can be made from xs
combinations 0 _ = [[]]
combinations _ [] = []
combinations n xs@(y:ys)
| n < 0 = []
| otherwise = case drop (n-1) xs of
[ ] -> []
[_] -> [xs]
_ -> [y:c | c <- combinations (n-1) ys]
++ combinations n ys
-- those items of with frequency o in the dataset
frequentItems xs o = [y| y <- nub cs, x<-[count y cs], x >= o]
where
cs = concat xs
isSubset a b = not $ any (`notElem` b) a
-- Count how many times the list y appears as a subset of a list of lists xs
countSubsets y xs = length $ filter (isSubset y ) xs
count :: Eq a => a -> [a] -> Int
count x [] = 0
count x (y:ys) | x == y = 1+(count x ys)
| otherwise = count x ys
transactions =[["Butter", "Biscuits", "Cream", "Newspaper", "Bread", "Chocolate"],
["Cream", "Newspaper", "Tea", "Oil", "Chocolate"] ,
["Chocolate", "Cereal", "Bread"],
["Chocolate", "Flour", "Biscuits", "Newspaper"],
["Chocolate", "Biscuits", "Newspaper"] ]
但是当我编译时,我得到了错误:
apriori.hs:5:51:
Occurs check: cannot construct the infinite type: a0 = [a0]
Expected type: [a0]
Actual type: [[a0]]
In the second argument of kItemSets', namely `ck'
In the second argument of `(++)', namely `(kItemSets' (k + 1) ck)'
Failed, modules loaded: none.
但是当我从ghci
:
*Main> mapM_ print $ filter (\x->(countSubsets x transactions ) >= 2 ) $ combinations 2 $ frequentItems transactions 2
["Biscuits","Newspaper"]
["Biscuits","Chocolate"]
["Cream","Newspaper"]
["Cream","Chocolate"]
["Newspaper","Chocolate"]
["Bread","Chocolate"]
哪个是正确的,因为它是那些满足事务集中出现阈值的2项集。但我对3项集的需求是
[["Biscuits", "Chocolate", "Newspaper" ],
["Chocolate", "Cream", "Newspaper"]]
并将其附加到2项目集的列表中。我如何更改当前代码以实现此目的?我知道它可以从2件套装开始构建,但我不知道该怎么做。
答案 0 :(得分:1)
不得不在第5行使用它:
kItemSets' k t = ck ++ (kItemSets' (k+1) $ nub $ concat ck)
效率最高但效果不错。