我创建了一个Spark Scala项目来测试XGBoost4J-Spark。该项目成功构建,但是在运行脚本时出现此错误:
Message: <console>:65: error: object dmlc is not a member of package org.apache.spark.ml
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
我看到了类似这样的问题,似乎在mlm.xml文件中添加了mllib。但是,我已经有了依赖项,并且它引发了错误。请告知。
标量代码:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
val spark = SparkSession.builder().config("spark.jars","target/accounts-by-state-1.0.jar").getOrCreate()
val schema = new StructType(Array(
StructField("sepal length", DoubleType, true),
StructField("sepal width", DoubleType, true),
StructField("petal length", DoubleType, true),
StructField("petal width", DoubleType, true),
StructField("class", StringType, true)))
val rawInput = spark.read.schema(schema).csv("iris.data")
import org.apache.spark.ml.feature.StringIndexer
val stringIndexer = new StringIndexer().
setInputCol("class").
setOutputCol("classIndex").
fit(rawInput)
val labelTransformed = stringIndexer.transform(rawInput).drop("class")
import org.apache.spark.ml.feature.VectorAssembler
val vectorAssembler = new VectorAssembler().
setInputCols(Array("sepal length", "sepal width", "petal length", "petal width")).
setOutputCol("features")
val xgbInput = vectorAssembler.transform(labelTransformed).select("features", "classIndex")
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
val xgbParam = Map("eta" -> 0.1f,
"missing" -> -999,
"objective" -> "multi:softprob",
"num_class" -> 3,
"num_round" -> 100,
"num_workers" -> 2)
val xgbClassifier = new XGBoostClassifier(xgbParam).
setFeaturesCol("features").
setLabelCol("classIndex")
Message: <console>:65: error: object dmlc is not a member of package org.apache.spark.ml
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
pom.xml文件内容:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.cloudera.training.devsh</groupId>
<artifactId>accounts-by-state</artifactId>
<version>1.0</version>
<packaging>jar</packaging>
<name>"Accounts by State"</name>
<properties>
<hadoop.version>3.0.0</hadoop.version>
<spark.version>2.4.0</spark.version>
<scala.version>2.11.12</scala.version>
<scala.binary.version>2.11</scala.binary.version>
<java.version>1.8</java.version>
</properties>
<repositories>
<repository>
<id>XGBoost4J Snapshot Repo</id>
<name>XGBoost4J Snapshot Repo</name>
<url>https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/snapshot/</url>
</repository>
<repository>
<id>apache-repo</id>
<name>Apache Repository</name>
<url>https://repository.apache.org/content/repositories/releases</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
<repository>
<id>cloudera-repo-releases</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
</repository>
</repositories>
<dependencies>
<dependency> <!-- Scala -->
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency> <!-- Core Spark -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency> <!-- Spark SQL -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>${spark.version}</version>
<scope>runtime</scope>
</dependency>
<dependency> <!-- Hadoop -->
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j_${scala.binary.version}</artifactId>
<version>1.1.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark_${scala.binary.version}</artifactId>
<version>1.1.0-SNAPSHOT</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.3</version>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<args>
<!-- work-around for https://issues.scala-lang.org/browse/SI-8358 -->
<arg>-nobootcp</arg>
</args>
</configuration>
</plugin>
</plugins>
</build>
</project>
编辑:
为了更详细地说明,我在/spark-application/xgboost-project/src/main/scala.xgboost/XgBoost.scala中创建了一个名为XgBoost.scala的文件
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.types._
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier._
//{StringType, LongType, StructField, StructType, DoubleType}
object XgBoost {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: XGBOOST TEST")
System.exit(1)
}
val stateCode = args(0)
val spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
val schema = new StructType(Array(
StructField("sepal length", DoubleType, true),
StructField("sepal width", DoubleType, true),
StructField("petal length", DoubleType, true),
StructField("petal width", DoubleType, true),
StructField("class", StringType, true)))
val rawInput = spark.read.schema(schema).csv("~/iris.data")
val stringIndexer = new StringIndexer().
setInputCol("class").
setOutputCol("classIndex").
fit(rawInput)
val labelTransformed = stringIndexer.transform(rawInput).drop("class")
val vectorAssembler = new VectorAssembler().
setInputCols(Array("sepal length", "sepal width", "petal length", "petal width")).
setOutputCol("features")
val xgbInput = vectorAssembler.transform(labelTransformed).select("features", "classIndex")
val xgbParam = Map("eta" -> 0.1f,
"missing" -> -999,
"objective" -> "multi:softprob",
"num_class" -> 3,
"num_round" -> 100,
"num_workers" -> 2)
val xgbClassifier = new XGBoostClassifier(xgbParam).
setFeaturesCol("features").
setLabelCol("classIndex")
spark.stop
}
}
然后我将pom.xml文件修改如下:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.cloudera.xgboost</groupId>
<artifactId>xgboost-project</artifactId>
<version>1.0</version>
<packaging>jar</packaging>
<name>"XGBoost Project"</name>
<properties>
<hadoop.version>3.0.0</hadoop.version>
<spark.version>2.4.0</spark.version>
<scala.version>2.11.12</scala.version>
<scala.binary.version>2.11</scala.binary.version>
<java.version>1.8</java.version>
</properties>
<repositories>
<repository>
<id>XGBoost4J Snapshot Repo</id>
<name>XGBoost4J Snapshot Repo</name>
<url>https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/snapshot/</url>
</repository>
<repository>
<id>apache-repo</id>
<name>Apache Repository</name>
<url>https://repository.apache.org/content/repositories/releases</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
<repository>
<id>cloudera-repo-releases</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
</repository>
</repositories>
<dependencies>
<dependency> <!-- Scala -->
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency> <!-- Core Spark -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency> <!-- Spark SQL -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency> <!-- Hadoop -->
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j_${scala.binary.version}</artifactId>
<version>1.1.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark_${scala.binary.version}</artifactId>
<version>1.1.0-SNAPSHOT</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>${app.main.class}</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>${app.main.class}</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<args>
<!-- work-around for https://issues.scala-lang.org/browse/SI-8358 -->
<arg>-nobootcp</arg>
</args>
</configuration>
</plugin>
</plugins>
</build>
</project>
当我在xgboost-project目录中提交“ mvn软件包”时,我得到:
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for com.cloudera.xgboost:xgboost-project:jar:1.0
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-jar-plugin is missing. @ line 89, column 17
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO]
[INFO] ----------------< com.cloudera.xgboost:xgboost-project >----------------
[INFO] Building "XGBoost Project" 1.0
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ xgboost-project ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /home/exercises/spark-application/xgboost-project/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ xgboost-project ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:compile (default) @ xgboost-project ---
[WARNING] Expected all dependencies to require Scala version: 2.11.12
[WARNING] com.cloudera.xgboost:xgboost-project:1.0 requires scala version: 2.11.12
[WARNING] com.twitter:chill_2.11:0.9.3 requires scala version: 2.11.12
[WARNING] org.apache.spark:spark-core_2.11:2.4.0 requires scala version: 2.11.12
[WARNING] org.json4s:json4s-jackson_2.11:3.5.3 requires scala version: 2.11.11
[WARNING] Multiple versions of scala libraries detected!
[INFO] /home/exercises/spark-application/xgboost-project/src/main/scala:-1: info: compiling
[INFO] Compiling 1 source files to /home/cdsw/exercises/spark-application/xgboost-project/target/classes at 1599664489439
[ERROR] /home/exercises/spark-application/xgboost-project/src/main/scala/xgboost/XgBoost.scala:48: error: not found: type XGBoostClassifier
[ERROR] val xgbClassifier = new XGBoostClassifier(xgbParam).
[ERROR] ^
[ERROR] one error found
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:08 min
[INFO] Finished at: 2020-09-09T15:14:53Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (default) on project xgboost-project:wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
这可能是我的pom.xml文件有问题吗?现在它还在抱怨多个Scala版本...但是主要是它似乎无法识别XGBoostClassifier对象...
答案 0 :(得分:1)
提交作业时,您需要提供XGBoost库-最简单的方法是通过--packages
的{{1}}标志指定Maven坐标,如下所示:
spark-submit
另一种可能性是创建一个具有所有依赖项的胖子-例如,通过使用汇编或分片插件,例如,将以下内容添加到pom.xml的spark-submit --packages ml.dmlc:xgboost4j-spark_2.11:1.1.0-SNAPSHOT other options
部分:
plugins
,然后使用生成的 <plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.2.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
文件执行
更新:这似乎是由于使用...jar-with-dependencies.jar
而不是import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier._
here是代码的可编译版本。一个问题是-您真的需要使用1.1.0-SNAPSHOT而不是稳定的1.0.0吗?