当我在pyspark中运行PCA时,我的内存不足。这是pyspark 1.6.3,而执行环境是一个齐柏林飞艇笔记本。这是一个例子。让df
成为一个pyspark DataFrame,其中' vector'是所需的输入列(包含数据的SparseVector)。
from pyspark.ml.feature import PCA
pca = PCA(k = 100, inputCol="vectors", outputCol = "pca").fit(df)
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2419389767585347468.py", line 360, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 2, in <module>
File "/usr/hdp/current/spark-client/python/pyspark/ml/pipeline.py", line 69, in fit
return self._fit(dataset)
File "/usr/hdp/current/spark-client/python/pyspark/ml/wrapper.py", line 133, in _fit
java_model = self._fit_java(dataset)
File "/usr/hdp/current/spark-client/python/pyspark/ml/wrapper.py", line 130, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o222.fit.
: java.lang.OutOfMemoryError: Java heap space
但请查看:
import pandas as pd
import numpy as np
pandf = df.toPandas()
densevectors = [np.array(sparse.toArray()) for sparse in pandf['vectors']]
xtrain = np.vstack(densevectors)
from sklearn.decomposition import PCA as skPCA
skpca = skPCA(n_components=100).fit(xtrain)
skpca.components_.shape
(100, 41277)
执行时间为14秒。当然,没有内存问题,因为输入数据集只有~9000行稀疏向量。在spark-defaults.conf中,驱动程序和执行程序内存都设置为12g,这是一个8节点集群,每个节点应该有32g可用。整个输入数据集甚至不会占用1 MB,甚至不能作为.csv格式。
为什么pyspark的PCA实施内存不足?