XGBoost4J-Spark错误-对象dmlc不是软件包org.apache.spark.ml的成员

时间:2020-09-07 03:45:33

标签: scala maven apache-spark apache-spark-mllib xgboost

我创建了一个Spark Scala项目来测试XGBoost4J-Spark。该项目成功构建,但是在运行脚本时出现此错误:

Message: <console>:65: error: object dmlc is not a member of package org.apache.spark.ml
       import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier

我看到了类似这样的问题,似乎在mlm.xml文件中添加了mllib。但是,我已经有了依赖项,并且它引发了错误。请告知。

标量代码:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}

val spark = SparkSession.builder().config("spark.jars","target/accounts-by-state-1.0.jar").getOrCreate()
val schema = new StructType(Array(
  StructField("sepal length", DoubleType, true),
  StructField("sepal width", DoubleType, true),
  StructField("petal length", DoubleType, true),
  StructField("petal width", DoubleType, true),
  StructField("class", StringType, true)))
val rawInput = spark.read.schema(schema).csv("iris.data")


import org.apache.spark.ml.feature.StringIndexer
val stringIndexer = new StringIndexer().
  setInputCol("class").
  setOutputCol("classIndex").
  fit(rawInput)
val labelTransformed = stringIndexer.transform(rawInput).drop("class")

import org.apache.spark.ml.feature.VectorAssembler
val vectorAssembler = new VectorAssembler().
  setInputCols(Array("sepal length", "sepal width", "petal length", "petal width")).
  setOutputCol("features")
val xgbInput = vectorAssembler.transform(labelTransformed).select("features", "classIndex")

import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
val xgbParam = Map("eta" -> 0.1f,
      "missing" -> -999,
      "objective" -> "multi:softprob",
      "num_class" -> 3,
      "num_round" -> 100,
      "num_workers" -> 2)
val xgbClassifier = new XGBoostClassifier(xgbParam).
      setFeaturesCol("features").
      setLabelCol("classIndex")

   Message: <console>:65: error: object dmlc is not a member of package org.apache.spark.ml
           import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier

pom.xml文件内容:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.cloudera.training.devsh</groupId>
  <artifactId>accounts-by-state</artifactId>
  <version>1.0</version>
  <packaging>jar</packaging>
  <name>"Accounts by State"</name>

  <properties>
    <hadoop.version>3.0.0</hadoop.version>
    <spark.version>2.4.0</spark.version>
    <scala.version>2.11.12</scala.version>
    <scala.binary.version>2.11</scala.binary.version>
    <java.version>1.8</java.version>
  </properties>
  
  <repositories>
    <repository>
  <id>XGBoost4J Snapshot Repo</id>
  <name>XGBoost4J Snapshot Repo</name>
  <url>https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/snapshot/</url>
</repository>
    <repository>
      <id>apache-repo</id>
      <name>Apache Repository</name>
      <url>https://repository.apache.org/content/repositories/releases</url>
      <releases>
        <enabled>true</enabled>
      </releases>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
    </repository>
   <repository>
     <id>cloudera-repo-releases</id>
     <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
   </repository> 
  </repositories>

  <dependencies>
      <dependency> <!-- Scala -->
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>${scala.version}</version>
      </dependency>

      <dependency> <!-- Core Spark -->
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
      </dependency>

      <dependency> <!-- Spark SQL -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
      </dependency>
    
      <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>${spark.version}</version>
            <scope>runtime</scope>
      </dependency>

      <dependency> <!-- Hadoop -->
         <groupId>org.apache.hadoop</groupId>
         <artifactId>hadoop-client</artifactId>
         <version>${hadoop.version}</version>
       </dependency>
    
    <dependency>
      <groupId>ml.dmlc</groupId>
      <artifactId>xgboost4j_${scala.binary.version}</artifactId>
      <version>1.1.0-SNAPSHOT</version>
  </dependency>
    
  <dependency>
      <groupId>ml.dmlc</groupId>
      <artifactId>xgboost4j-spark_${scala.binary.version}</artifactId>
      <version>1.1.0-SNAPSHOT</version>
  </dependency>

  </dependencies>

  <build>
      <sourceDirectory>src/main/scala</sourceDirectory>
      <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.3</version>
        </plugin>
        <plugin>
          <groupId>net.alchim31.maven</groupId>
          <artifactId>scala-maven-plugin</artifactId>
          <version>3.2.2</version>
          <executions>
            <execution>
              <goals>
                <goal>compile</goal>
                <goal>testCompile</goal>
              </goals>
            </execution>
          </executions>
          <configuration>
            <args>
              <!-- work-around for https://issues.scala-lang.org/browse/SI-8358 -->
              <arg>-nobootcp</arg>
            </args>
          </configuration>
        </plugin>
      </plugins>
    </build>

</project>

编辑:

为了更详细地说明,我在/spark-application/xgboost-project/src/main/scala.xgboost/XgBoost.scala中创建了一个名为XgBoost.scala的文件

import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.types._
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier._

//{StringType, LongType, StructField, StructType, DoubleType}

object XgBoost {
  def main(args: Array[String]) {
    if (args.length < 1) {
      System.err.println("Usage: XGBOOST TEST")
      System.exit(1)
    }
 
    val stateCode = args(0)
    
    val spark = SparkSession.builder.getOrCreate()
    spark.sparkContext.setLogLevel("WARN")

    val schema = new StructType(Array(
      StructField("sepal length", DoubleType, true),
      StructField("sepal width", DoubleType, true),
      StructField("petal length", DoubleType, true),
      StructField("petal width", DoubleType, true),
      StructField("class", StringType, true)))
    
    val rawInput = spark.read.schema(schema).csv("~/iris.data")

    val stringIndexer = new StringIndexer().
      setInputCol("class").
      setOutputCol("classIndex").
      fit(rawInput)
    val labelTransformed = stringIndexer.transform(rawInput).drop("class")

    val vectorAssembler = new VectorAssembler().
      setInputCols(Array("sepal length", "sepal width", "petal length", "petal width")).
      setOutputCol("features")
    val xgbInput = vectorAssembler.transform(labelTransformed).select("features", "classIndex")


    val xgbParam = Map("eta" -> 0.1f,
      "missing" -> -999,
      "objective" -> "multi:softprob",
      "num_class" -> 3,
      "num_round" -> 100,
      "num_workers" -> 2)
    val xgbClassifier = new XGBoostClassifier(xgbParam).
      setFeaturesCol("features").
      setLabelCol("classIndex")

    spark.stop
  }
}

然后我将pom.xml文件修改如下:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.cloudera.xgboost</groupId>
  <artifactId>xgboost-project</artifactId>
  <version>1.0</version>
  <packaging>jar</packaging>
  <name>"XGBoost Project"</name>

  <properties>
    <hadoop.version>3.0.0</hadoop.version>
    <spark.version>2.4.0</spark.version>
    <scala.version>2.11.12</scala.version>
    <scala.binary.version>2.11</scala.binary.version>
    <java.version>1.8</java.version>
  </properties>
  
  <repositories>
    <repository>
  <id>XGBoost4J Snapshot Repo</id>
  <name>XGBoost4J Snapshot Repo</name>
  <url>https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/snapshot/</url>
</repository>
    <repository>
      <id>apache-repo</id>
      <name>Apache Repository</name>
      <url>https://repository.apache.org/content/repositories/releases</url>
      <releases>
        <enabled>true</enabled>
      </releases>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
    </repository>
   <repository>
     <id>cloudera-repo-releases</id>
     <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
   </repository> 
  </repositories>

  <dependencies>
      <dependency> <!-- Scala -->
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>${scala.version}</version>
      </dependency>

      <dependency> <!-- Core Spark -->
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
      </dependency>

      <dependency> <!-- Spark SQL -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
      </dependency>
    
      <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>${spark.version}</version>
      </dependency>

      <dependency> <!-- Hadoop -->
         <groupId>org.apache.hadoop</groupId>
         <artifactId>hadoop-client</artifactId>
         <version>${hadoop.version}</version>
       </dependency>
    
    <dependency>
      <groupId>ml.dmlc</groupId>
      <artifactId>xgboost4j_${scala.binary.version}</artifactId>
      <version>1.1.0-SNAPSHOT</version>
  </dependency>
    
  <dependency>
      <groupId>ml.dmlc</groupId>
      <artifactId>xgboost4j-spark_${scala.binary.version}</artifactId>
      <version>1.1.0-SNAPSHOT</version>
  </dependency>

  </dependencies>

  <build>
      <sourceDirectory>src/main/scala</sourceDirectory>
      <plugins>
      
        <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>${app.main.class}</mainClass>
                        </manifest>
                    </archive>
                </configuration>
        </plugin>
        
        <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>3.3.0</version>
        <configuration>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
          <archive>
              <manifest>
                  <mainClass>${app.main.class}</mainClass>
              </manifest>
          </archive>
        </configuration>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
        </plugin>
        
        <plugin>
          <groupId>net.alchim31.maven</groupId>
          <artifactId>scala-maven-plugin</artifactId>
          <version>3.2.2</version>
          <executions>
            <execution>
              <goals>
                <goal>compile</goal>
                <goal>testCompile</goal>
              </goals>
            </execution>
          </executions>
          <configuration>
            <args>
              <!-- work-around for https://issues.scala-lang.org/browse/SI-8358 -->
              <arg>-nobootcp</arg>
            </args>
          </configuration>
        </plugin>
        

      </plugins>
    </build>

</project> 

当我在xgboost-project目录中提交“ mvn软件包”时,我得到:

[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for com.cloudera.xgboost:xgboost-project:jar:1.0
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-jar-plugin is missing. @ line 89, column 17
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO]
[INFO] ----------------< com.cloudera.xgboost:xgboost-project >----------------
[INFO] Building "XGBoost Project" 1.0
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ xgboost-project ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /home/exercises/spark-application/xgboost-project/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ xgboost-project ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:compile (default) @ xgboost-project ---
[WARNING]  Expected all dependencies to require Scala version: 2.11.12
[WARNING]  com.cloudera.xgboost:xgboost-project:1.0 requires scala version: 2.11.12
[WARNING]  com.twitter:chill_2.11:0.9.3 requires scala version: 2.11.12
[WARNING]  org.apache.spark:spark-core_2.11:2.4.0 requires scala version: 2.11.12
[WARNING]  org.json4s:json4s-jackson_2.11:3.5.3 requires scala version: 2.11.11
[WARNING] Multiple versions of scala libraries detected!
[INFO] /home/exercises/spark-application/xgboost-project/src/main/scala:-1: info: compiling
[INFO] Compiling 1 source files to /home/cdsw/exercises/spark-application/xgboost-project/target/classes at 1599664489439
[ERROR] /home/exercises/spark-application/xgboost-project/src/main/scala/xgboost/XgBoost.scala:48: error: not found: type XGBoostClassifier
[ERROR]     val xgbClassifier = new XGBoostClassifier(xgbParam).
[ERROR]                             ^
[ERROR] one error found
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:08 min
[INFO] Finished at: 2020-09-09T15:14:53Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (default) on project xgboost-project:wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

这可能是我的pom.xml文件有问题吗?现在它还在抱怨多个Scala版本...但是主要是它似乎无法识别XGBoostClassifier对象...

1 个答案:

答案 0 :(得分:1)

提交作业时,您需要提供XGBoost库-最简单的方法是通过--packages的{​​{1}}标志指定Maven坐标,如下所示:

spark-submit

另一种可能性是创建一个具有所有依赖项的胖子-例如,通过使用汇编或分片插件,例如,将以下内容添加到pom.xml的spark-submit --packages ml.dmlc:xgboost4j-spark_2.11:1.1.0-SNAPSHOT other options 部分:

plugins

,然后使用生成的 <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>3.2.0</version> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> 文件执行

更新:这似乎是由于使用...jar-with-dependencies.jar而不是import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier._

而发生的

here是代码的可编译版本。一个问题是-您真的需要使用1.1.0-SNAPSHOT而不是稳定的1.0.0吗?