我目前正在尝试使用Mahout中的randomforest的部分实现来对数据进行分类。
虽然我能够使用一组训练有素的森林对一定数量的数据进行分类,但我无法使用更大的数据(大约两倍)和相同的分类器进行分类。
事实上,在MR期间完成的分类工作正常并且显示成功。遗憾的是,当计算分析时,它总是以OutOfMemoryException结束,这可能是由于达到了GCOverheadLimit而发生的。我还添加了选项:
-Dmapred.child.java.opts =" -Xmx20g -XX:-UseGCOverheadLimit"
来电,但没有帮助。
我记得有些时候,当我使用早期版本的mahout(我认为它是0.7)时,它可以使用testforest方法对几乎任意大的数据集进行分类,并输出分析措施,例如混淆矩阵等。我是困惑为什么整个过程中最简单的步骤会导致这样的错误。
有没有办法轻松解决这个问题?
以下是其中一个日志:
15/05/25 13:58:26 INFO mapreduce.Job: map 97% reduce 0%
15/05/25 13:58:46 INFO mapreduce.Job: map 98% reduce 0%
15/05/25 13:59:43 INFO mapreduce.Job: map 99% reduce 0%
15/05/25 14:01:20 INFO mapreduce.Job: map 100% reduce 0%
15/05/25 14:02:11 INFO mapreduce.Job: Job job_1432549186261_0032 completed successfully
15/05/25 14:02:12 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=2202834240
FILE: Number of bytes written=3408230
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=580537741
HDFS: Number of bytes written=343262060
HDFS: Number of read operations=150
HDFS: Number of large read operations=0
HDFS: Number of write operations=60
Job Counters
Failed map tasks=9
Launched map tasks=39
Other local map tasks=19
Data-local map tasks=17
Rack-local map tasks=3
Total time spent by all maps in occupied slots (ms)=3387270
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=3387270
Total vcore-seconds taken by all map tasks=3387270
Total megabyte-seconds taken by all map tasks=10405693440
Map-Reduce Framework
Map input records=16993025
Map output records=16993045
Input split bytes=4950
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=15935
CPU time spent (ms)=1798740
Physical memory (bytes) snapshot=27353509888
Virtual memory (bytes) snapshot=102048583680
Total committed heap usage (bytes)=58348666880
File Input Format Counters
Bytes Read=580532791
File Output Format Counters
Bytes Written=343262060
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.mahout.classifier.df.mapreduce.Classifier.parseOutput(Classifier.java:169)
at org.apache.mahout.classifier.df.mapreduce.Classifier.run(Classifier.java:130)
at org.apache.mahout.classifier.df.mapreduce.TestForest.mapreduce(TestForest.java:188)
at org.apache.mahout.classifier.df.mapreduce.TestForest.testForest(TestForest.java:174)
at org.apache.mahout.classifier.df.mapreduce.TestForest.run(TestForest.java:146)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.mahout.classifier.df.mapreduce.TestForest.main(TestForest.java:315)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:153)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
答案 0 :(得分:1)
好的,我找到了解决方案,虽然我不清楚它为什么会起作用:
添加...
export HADOOP_CLIENT_OPTS="-Xmx20192m"
......脚本做了伎俩。
使用MAHOUT_HEAPSIZE=40000
或-Dmapred.child.java.opts
没有帮助。
我在这里找到了解决方案的灵感: https://community.cloudera.com/t5/Data-Science-and-Machine/Java-heap-size-running-mahout-clusterdump/td-p/7752
如果遇到类似问题,您可能有兴趣设置可在mahout脚本中找到的变量,该脚本也可在线获取: