我在sparks的mllib中使用FPGrowth来找到频繁的模式。 这是我的代码:
object FPGrowthExample{
def main(args:Array[String]){
val conf = new SparkConf().setAppName("FPGrowthExample")
val sc = new SparkContext(conf)
val data = sc.textFile("/user/text").map(s => s.trim.split(" ")).cache()
val fpg = new FPGrowth().setMinSupport(0.005).setNumPartitions(10)
val model = fpg.run(data)
val output = model.freqItemsets.map(itemset => itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
output.repartition(1).saveAsTextFile("/user/result")
sc.stop()
}
}
当文本有800000行并且每行被视为doc时,spark会给出stackoverflower错误。 这是错误:
java.lang.StackOverflowError
at java.lang.Exception.<init>(Exception.java:102)
at java.lang.ReflectiveOperationException.<init>
(ReflectiveOperationException.java:89)
at java.lang.reflect.InvocationTargetException.<init>
(InvocationTargetException.java:72)
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:137)
at scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:135)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashTable$class.serializeTo(HashTable.scala:124)
at scala.collection.mutable.HashMap.serializeTo(HashMap.scala:39)
at scala.collection.mutable.HashMap.writeObject(HashMap.scala:135)
这是我的提交脚本:
/usr/local/webserver/spark-1.5.1-bin-2.6.0/bin/spark-submit --master yarn -- deploy-mode cluster
--num-executors 30 --driver-memory 30g
--executor-memory 30g --executor-cores 10
--conf spark.driver.maxResultSize-10g --class FPGrowthExample project.jar
我不知道如何修复它,当输入只有1000行时运行良好。