我正在使用以下代码在sparkR上运行逻辑回归:
SPARK_HOME <- "C:\\Users\\Softwares\\spark-1.6.0-bin"
Sys.setenv(SPARK_HOME = "C:\\Users\\Softwares\\spark-1.6.0-bin")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.11:1.2.0" "sparkr-shell"')
Sys.setenv(HADOOP_CONF = "C:\\Users\\Softwares\\hadoop-2.6.0\\bin")
Sys.setenv(HADOOP_HOME = "C:\\Users\\Softwares\\hadoop-2.6.0")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master = "local[4]", sparkHome = SPARK_HOME, sparkEnvir = list(spark.driver.memory="512m"))
sqlContext <- sparkRSQL.init(sc)
df1 <- read.df(sqlContext, "flattened-path1.csv", source = "com.databricks.spark.csv")
clogit <- glm(C13 ~ C1+C2+C3+C4+C5+C6+C7+C8+C9+C10
+C11+C12, data = df1, family = "binomial")
我得到以下例外:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.spark.SparkException: Currently, LogisticRegression with ElasticNet in ML package only supports binary classification. Found 5 in the input dataset.
at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:290)
at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:144)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:140)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:140)
目标变量C13是具有4个唯一值的整数字段。如果这是一个问题,是否会将此字段转换为因子或其他任何内容?
更新 我调整了我的数据集并删除了两个唯一值,这样我只剩下两个值(0和1)。令人惊讶的是,我得到了几乎相同的错误消息:
Currently, LogisticRegression with ElasticNet in ML package only supports binary classification. Found 3 in the input dataset