我正在尝试从https://spark.apache.org/docs/2.2.0/ml-features.html#pca
运行示例pyspark PCA代码我为DataFrame加载了5,000,000条记录,23,000个功能。 运行PCA代码后,我有以下错误
Py4JJavaError: An error occurred while calling o908.fit.
: java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:794)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:793)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1137)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1128)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:122)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:344)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponentsAndExplainedVariance(RowMatrix.scala:387)
at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:48)
at org.apache.spark.ml.feature.PCA.fit(PCA.scala:99)
at org.apache.spark.ml.feature.PCA.fit(PCA.scala:70)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
火花版是2.2 我用纱线运行火花 和火花参数是:
spark.executor.memory=32G
spark.driver.memory=32G
spark.driver.maxResultSize=32G
我应该删除运行PCA的功能吗?还是其他解决方案?
答案 0 :(得分:1)
我怀疑您可以使用其他配置来运行它。你有几个执行人?如果您有100个执行程序,并且在总共1TB内存的系统上为每个执行程序分配了32GB的内存,则由于每个执行程序尝试获取3.2TB内存(不存在)中的一部分,您将很快用完。另一方面,如果您正在运行1个执行程序,则32GB可能不足以运行任务。您可能会发现,运行20个执行器,每个执行器具有8GB内存,将给您足够的运行任务的时间(尽管可能很慢)。
当ML流程中的数据框出现问题时,我通常会按照以下步骤进行故障排除: 1)在一个很小的数据帧上测试该方法:10个要素和1,000行。为了帮助避免沿袭问题,建议您在源代码处减少示例框架,方法是在SQL中使用“ limit”语句或通过传递较小的CSV。如果该方法不适用于您的代码,则内存问题可能是次要的。 2)如果该方法不适用于较小的数据框,请开始调查数据本身。您的功能全部都是数字的吗?您的任何功能都具有空值吗?功能中包含非数字或空值可能会导致PCA例程中断(但不一定会出现OutOfMemory错误) 3)如果数据格式正确,并且您的代码格式正确,则开始进行扩展,并确保在进行过程中查看节点中的stderr和stdout。要到达您的节点,您应该有一个实用程序(例如,Hadoop的Cloudera发行版包括ClouderaManager,它允许您查看Jobs,Stages,然后查看各个任务以查找stderr)。