在Windows中使用sparkr构建glm模型但是它非常慢并且在执行R代码时出错

时间:2016-10-19 04:07:31

标签: r apache-spark glm sparkr

数据集很大,包含30列和200000条记录。我正在使用sparkr构建glm模型,但是模型执行需要花费很多时间并且还会出错。如何使用Sparkr减少模型构建时间并解决下面给出的错误。请给我改进此代码的建议。

R代码: 设置Spark Home

Sys.setenv(SPARK_HOME="C:/spark/spark-2.0.0-bin-hadoop2.7")

设置库路径

.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))

Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk1.7.0_71")

加载SparkR库

library(SparkR)
library(rJava)

sc <- sparkR.session(enableHiveSupport = FALSE,master = "local[*]",appName = "SparkR-Modi",sparkConfig = list(spark.sql.warehouse.dir="file:///c:/tmp/spark-warehouse"))
sqlContext <- sparkRSQL.init(sc)
spdf <- read.df(sqlContext, "C:/Users/prasann/Desktop/V/bigdata11.csv", source = "com.databricks.spark.csv", header = "true")
showDF(spdf)

glm模型

md <- glm(NP_OfferCurrentResponse ~., family = "binomial", data = spdf)

错误:(模型执行非常慢并且出错)

> md <- glm(NP_OfferCurrentResponse ~., family = "binomial", data = spdf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
java.lang.AssertionError: assertion failed: lapack.dppsv returned 226.
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40)
at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:140)
at org.apache.spark.ml.regression.GeneralizedLinearRegression$FamilyAndLink.initialize(GeneralizedLinearRegression.scala:340)
at org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:275)
at org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:139)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:145)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.c

0 个答案:

没有答案