Question

我在我的群集上运行Mahout 0.7，该群集有30个节点（每个节点有8个内核16G内存），尝试使用canopy-cluster 250000 SparseVector（300000）。

如果我通过调整冠层参数（T1，T2）进行冠层聚类寻找少量冠层中心，那么它的效果非常好。

超过一定数量的冠层中心，作业在减少阶段的67％时出现“错误：Java堆空间”消息而失败。

如果K的值增加，K-means聚类也会出现相同的堆空间问题。

我听说过冠层中心向量和k中心向量都保存在每个映射器和缩减器的内存中。这将是冠层中心（或k）x稀疏矢量（300000大小）=足够4g内存的东西，这似乎并不太糟糕。

根据此处和其他地方的先前问题，我已经找到了我能找到的每个记忆旋钮：

hadoop-env.sh：在namenode上将所有堆空间设置为16GB，在datanode上设置为8GB。
mapred-site.xml：添加mapred。{map，reduce} .child.java.opts属性，并将其值设置为-Xmx4000m
mapred-site.xml：更改mapred.tasktracker。{map，reduce} .tasks.maximum属性，并将其值从8减为4

问题仍然存在。我一直在反对这个问题太长时间了 - 有没有人有任何建议？

完整的命令和输出如下所示：

    public static void main(String [] args) throws Exception{

    String ratingsPath = args[0];
    String outputPath = args[1];
    String T1 = args[2];
    String T2 = args[3];

    Configuration conf = new Configuration();       

    HadoopUtil.delete(conf, new Path(outputPath));

    CanopyDriver.run(conf, new Path(ratingsPath), new Path(outputPath), new ManhattanDistanceMeasure(), 
            Double.parseDouble(T1), Double.parseDouble(T2), true, 0.0, false);

}

这是我面对的错误信息：

Exception in thread "main" java.lang.InterruptedException: Canopy Job failed processing /MrBic/Output/SeedGeneration_predSample
at org.apache.mahout.clustering.canopy.CanopyDriver.buildClustersMR(CanopyDriver.java:363)
at org.apache.mahout.clustering.canopy.CanopyDriver.buildClusters(CanopyDriver.java:248)
at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:155)
at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:170)
at MrBicClusteringDriver.main(MrBicClusteringDriver.java:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

2013-06-12 10:56:00,825 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space
at org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
at org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
at org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
at org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:560)
at org.apache.mahout.clustering.AbstractCluster.observe(AbstractCluster.java:275)
at org.apache.mahout.clustering.canopy.Canopy.<init>(Canopy.java:43)
at org.apache.mahout.clustering.canopy.CanopyClusterer.addPointToCanopies(CanopyClusterer.java:163)
at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyReducer.java:47)
at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyReducer.java:30)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

Mahout Canopy聚类，K-means聚类：Java堆空间 - 内存不足

0 个答案: