spark java.lang.stackoverflow逻辑回归适合大数据集

时间:2017-09-22 15:38:40

标签: apache-spark pyspark

我正在尝试为具有470个功能和1000万个训练实例的数据集拟合逻辑回归模型。这是我的代码片段。

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import RFormula

formula = RFormula(formula = "label ~ .-classWeight")


bestregLambdaVal = 0.005
bestregAlphaVal = 0.01

lr = LogisticRegression(maxIter=1000, regParam=bestregLambdaVal, elasticNetParam=bestregAlphaVal,weightCol="classWeight") 
pipeLineLr = Pipeline(stages = [formula, lr])
pipeLineFit = pipeLineLr.fit(mySparkDataFrame[featureColumnNameList + ['classWeight','label']])

我还创建了一个检查点目录,

sc.setCheckpointDir('checkpoint/')

如下所示: Spark gives a StackOverflowError when training using ALS

但是我收到错误,这是一个部分跟踪:

File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 108, in _fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 265, in _fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 262, in _fit_java
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o383361.fit.
: java.lang.StackOverflowError
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
    at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)

我还要注意,使用withcolumn()将470个要素列迭代地添加到火花数据框。

1 个答案:

答案 0 :(得分:0)

所以我犯的错误是,在检查数据帧时,我只会这样做:

mySparkDataFrame = mySparkDataFrame.checkpoint(eager=True)

正确的方法是:

G = (V, E)

这是基于我在这里提出的另一个问题(并得到了答案):

sklearn.metrics.pairwise

此外,建议在检查点之前持久化()数据帧,并在检查点之后对其进行count()