我试图从这个数据集中提取一些关联规则:
49
70
27,66
6
27
66,8,64
32
82
66
71
44
1
33
17
31,83
50,29
22
72
8
8,16
56
83,61
85,63,37
50,57
2
50
96,6
73
57
12
62
96
3
47,50,73
35
85,45
25,96,22,17
85
24
17,57
34,4
60,96,45
25
85,66,73
30
14
73,85
64
48
5
37
13,55
37,17
我有这段代码:
val transactions = sc.textFile("/user/cloudera/dataset1")
import org.apache.spark.mllib.fpm.AssociationRules
import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset
val freqItemsets = transactions.flatMap(xs =>
(xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).map(x => (x.toList, 1L))
).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}
val ar = new AssociationRules().setMinConfidence(0.4)
val results = ar.run(freqItemsets)
results.collect().foreach { rule =>
println("[" + rule.antecedent.mkString(",")
+ "=>"
+ rule.consequent.mkString(",") + "]," + rule.confidence)
}
但是我的输出中出现了一些意想不到的行:
[2,9=>5],0.5
[8,5,,,3=>6],1.0
[8,5,,,3=>7],0.5
[8,5,,,3=>7],0.5
[,,,=>6],0.5
[,,,=>7],0.5
[,,,=>5],0.5
[,,,=>3],0.5
[4,3=>7],1.0
[4,3=>,,,],1.0
[4,3=>,,,],1.0
[4,3=>5],1.0
[4,3=>7,7],1.0
[4,3=>7,7],1.0
[4,3=>0],1.0
为什么我会得到这样的输出:
[,,,=>3],0.5
我不理解这个问题......任何人都知道如何解决这个问题?
非常感谢!
答案 0 :(得分:0)
所有这些结果都应该是意料之外的,因为您的代码中存在错误!
您需要创建项目的组合。就目前而言,你的代码是在字符串中创建字符组合(例如" 25,96,22,17和#34;),这当然不会给出正确的结果(而且'为什么你看到","
作为一个元素。)
要修复,请添加:val freqItemsets = transactions.map(_.split(",")).
所以而不是
val freqItemsets = transactions.flatMap(xs =>
(xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).map(x => (x.toList, 1L))
).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}
你有:
val freqItemsets = transactions.map(_.split(",")).flatMap(xs =>
(xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).map(x => (x.toList, 1L)) ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}
这将给出预期的结果:
[96,17=>22],1.0
[96,17=>25],1.0
[85,37=>63],1.0
[47,73=>50],1.0
[31=>83],1.0
[60,45=>96],1.0
[60=>45],1.0
[60=>96],1.0
[96,45=>60],1.0
[22,17=>25],1.0
[22,17=>96],1.0
[66,8=>64],1.0
[63,37=>85],1.0
[66,64=>8],1.0
[25,22,17=>96],1.0
[27=>66],0.5
[96,22,17=>25],1.0
[61=>83],1.0
[64=>66],0.5
[64=>8],0.5
[45=>60],0.5
[45=>96],0.5
[45=>85],0.5
[6=>96],0.5
[47=>73],1.0
[47=>50],1.0
[50,73=>47],1.0
[96,22=>17],1.0
[96,22=>25],1.0
[66,73=>85],1.0
[8,64=>66],1.0
[29=>50],1.0
[83=>31],0.5
[83=>61],0.5
[25,96,17=>22],1.0
[85,66=>73],1.0
[25,96,22=>17],1.0
[25,96=>17],1.0
[25,96=>22],1.0
[22=>17],0.5
[22=>96],0.5
[22=>25],0.5
[85,73=>66],1.0
[55=>13],1.0
[60,96=>45],1.0
[63=>37],1.0
[63=>85],1.0
[25,22=>17],1.0
[25,22=>96],1.0
[16=>8],1.0
[25=>96],0.5
[25=>22],0.5
[25=>17],0.5
[34=>4],1.0
[85,63=>37],1.0
[47,50=>73],1.0
[13=>55],1.0
[4=>34],1.0
[25,17=>22],1.0
[25,17=>96],1.0