I am trying to run LDA on a very small dataset of ~1000 documents. The LDA work fine and I am also able to save the model.
If I run the program without lDAModel.save()
, I get the following at the end:
16/03/13 14:26:52 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:53759
16/03/13 14:26:52 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/03/13 14:26:52 INFO MemoryStore: MemoryStore cleared
16/03/13 14:26:52 INFO BlockManager: BlockManager stopped
16/03/13 14:26:52 INFO BlockManagerMaster: BlockManagerMaster stopped
16/03/13 14:26:52 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/03/13 14:26:52 INFO SparkContext: Successfully stopped SparkContext
16/03/13 14:26:52 INFO ShutdownHookManager: Shutdown hook called
16/03/13 14:26:52 INFO ShutdownHookManager: Deleting directory /tmp/spark-753c7923-b623-45a7-afd1-5738766d7571
But if I save the model I get the following at the end of the output:
16/03/13 14:44:01 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
16/03/13 14:44:01 INFO deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
16/03/13 14:44:01 INFO deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
16/03/13 14:44:01 INFO deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
16/03/13 14:44:01 INFO FileOutputCommitter: Saved output of task 'attempt_201603131444_0041_m_000000_151' to file:/tmp/tempLDA.model/metadata/_temporary/0/task_201603131444_0041_m_000000
16/03/13 14:44:01 INFO SparkHadoopMapRedUtil: attempt_201603131444_0041_m_000000_151: Committed
16/03/13 14:44:01 INFO Executor: Finished task 0.0 in stage 41.0 (TID 151). 873 bytes result sent to driver
16/03/13 14:44:01 INFO TaskSetManager: Finished task 0.0 in stage 41.0 (TID 151) in 85 ms on localhost (1/1)
16/03/13 14:44:01 INFO TaskSchedulerImpl: Removed TaskSet 41.0, whose tasks have all completed, from pool
16/03/13 14:44:01 INFO DAGScheduler: ResultStage 41 (saveAsTextFile at LDAModel.scala:433) finished in 0.085 s
16/03/13 14:44:01 INFO DAGScheduler: Job 39 finished: saveAsTextFile at LDAModel.scala:433, took 0.116725 s
16/03/13 14:44:01 INFO BlockManagerInfo: Removed broadcast_53_piece0 on localhost:44879 in memory (size: 16.4 KB, free: 1087.1 MB)
16/03/13 14:44:01 INFO ContextCleaner: Cleaned accumulator 44
Exception in thread "main" 16/03/13 14:44:02 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:44879 in memory (size: 10.0 KB, free: 1087.1 MB)
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"
Process finished with exit code 1
The model is saved in the second case but still there is an OutOfMemoryError
in the end of program.
What should I do to correct this?