我想使用以下代码Spark-Scala提取一组事务的关联规则:
val fpg = new FPGrowth().setMinSupport(minSupport).setNumPartitions(10)
val model = fpg.run(transactions)
model.generateAssociationRules(minConfidence).collect()
然而,产品数量超过10K,因此提取所有组合的规则具有计算表现力,而且我并不需要它们。所以我想只提取配对:
Product 1 ==> Product 2
Product 1 ==> Product 3
Product 3 ==> Product 1
我并不关心其他组合,例如:
[Product 1] ==> [Product 2, Product 3]
[Product 3,Product 1] ==> Product 2
有没有办法做到这一点?
谢谢, 阿米尔
答案 0 :(得分:4)
假设您的交易看起来或多或少是这样的:
val transactions = sc.parallelize(Seq(
Array("a", "b", "e"),
Array("c", "b", "e", "f"),
Array("a", "b", "c"),
Array("c", "e", "f"),
Array("d", "e", "f")
))
您可以尝试手动生成频繁项目集并直接应用AssociationRules
:
import org.apache.spark.mllib.fpm.AssociationRules
import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset
val freqItemsets = transactions
.flatMap(xs =>
(xs.combinations(1) ++ xs.combinations(2)).map(x => (x.toList, 1L))
)
.reduceByKey(_ + _)
.map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}
val ar = new AssociationRules()
.setMinConfidence(0.8)
val results = ar.run(freqItemsets)
注意:
freqItemsets
flatMap
如果freqItemsets
要处理大,您可以将freqItemsets
分成几个步骤来模仿实际的FP增长: