数据集很大,包含30列和200000条记录。我正在使用sparkr构建glm模型,但是模型执行需要花费很多时间并且还会出错。如何使用Sparkr减少模型构建时间并解决下面给出的错误。请给我改进此代码的建议。
R代码: 设置Spark Home
Sys.setenv(SPARK_HOME="C:/spark/spark-2.0.0-bin-hadoop2.7")
设置库路径
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk1.7.0_71")
加载SparkR库
library(SparkR)
library(rJava)
sc <- sparkR.session(enableHiveSupport = FALSE,master = "local[*]",appName = "SparkR-Modi",sparkConfig = list(spark.sql.warehouse.dir="file:///c:/tmp/spark-warehouse"))
sqlContext <- sparkRSQL.init(sc)
spdf <- read.df(sqlContext, "C:/Users/prasann/Desktop/V/bigdata11.csv", source = "com.databricks.spark.csv", header = "true")
showDF(spdf)
glm模型
md <- glm(NP_OfferCurrentResponse ~., family = "binomial", data = spdf)
错误:(模型执行非常慢并且出错)
> md <- glm(NP_OfferCurrentResponse ~., family = "binomial", data = spdf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.AssertionError: assertion failed: lapack.dppsv returned 226.
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40)
at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:140)
at org.apache.spark.ml.regression.GeneralizedLinearRegression$FamilyAndLink.initialize(GeneralizedLinearRegression.scala:340)
at org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:275)
at org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:139)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:145)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.c