Spark Cassandra连接器Maven构建问题

时间:2018-04-04 09:43:28

标签: apache-spark datastax spark-cassandra-connector

您好我正在尝试编写从Cassandra读取数据的Spark应用程序。我的Scala版本是2.11,Spark版本是2.2.0。不幸的是我面临构建问题。它说“在加载类文件'package.class'时检测到缺失或无效的依赖项。我不知道是什么导致了这个问题。

这是我的POM文件

<properties>
        <maven.compiler.source>1.6</maven.compiler.source>
        <maven.compiler.target>1.6</maven.compiler.target>
        <encoding>UTF-8</encoding>
        <!--scala.tools.version>2.11.8</scala.tools.version-->
        <scala.version>2.11.8</scala.version>
    </properties>

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <testSourceDirectory>src/test/scala</testSourceDirectory>
        <plugins>
            <plugin>
                <!-- see http://davidb.github.com/scala-maven-plugin -->
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.1.3</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <!--arg>-make:transitive</arg-->
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.13</version>
                <configuration>
                    <useFile>false</useFile>
                    <disableXmlReport>true</disableXmlReport>
                    <!-- If you have classpath issue like NoDefClassError,... -->
                    <!-- useManifestOnlyJar>false</useManifestOnlyJar -->
                    <includes>
                        <include>**/*Test.*</include>
                        <include>**/*Suite.*</include>
                    </includes>
                </configuration>
            </plugin>

            <!-- "package" command plugin -->
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.4.1</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

    <dependencies>
        <!-- Scala and Spark dependencies -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-xml</artifactId>
            <version>2.11.0-M4</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang.modules</groupId>
            <artifactId>scala-parser-combinators_2.11</artifactId>
            <version>1.0.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>
        <dependency>
            <groupId>com.datastax.spark</groupId>
            <artifactId>spark-cassandra-connector_2.11</artifactId>
            <version>2.0.7</version>
        </dependency>
        <!--dependency>
            <groupId>com.datastax.spark</groupId>
            <artifactId>spark-cassandra-connector-java_2.11</artifactId>
            <version>1.5.0-RC1</version>
        </dependency-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.12</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-core</artifactId>
            <version>2.7.1</version>
        </dependency>
    </dependencies>

我收到以下错误

[INFO] --- maven-resources-plugin:2.3:resources (default-resources) @ search-count ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] 
[INFO] --- maven-compiler-plugin:2.0.2:compile (default-compile) @ search-count ---
[INFO] Nothing to compile - all classes are up to date
[INFO] 
[INFO] --- scala-maven-plugin:3.1.3:compile (default) @ search-count ---
[WARNING]  Expected all dependencies to require Scala version: 2.11.8
[WARNING]  search-count:search-count:0.0.1-SNAPSHOT requires scala version: 2.11.8
[WARNING]  org.scala-lang.modules:scala-parser-combinators_2.11:1.0.2 requires scala version: 2.11.1
[WARNING] Multiple versions of scala libraries detected!
[ERROR] error: missing or invalid dependency detected while loading class file 'package.class'.
[INFO] Could not access type DataFrame in value org.apache.spark.sql.package,
[INFO] because it (or its dependencies) are missing. Check your build definition for
[INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
[INFO] A full rebuild may help if 'package.class' was compiled against an incompatible version of org.apache.spark.sql.package.
[ERROR] one error found
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.052s
[INFO] Finished at: Wed Apr 04 11:33:51 CEST 2018
[INFO] Final Memory: 22M/425M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.1.3:compile (default) on project search-count: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 1(Exit value: 1) -> [Help 1]

任何想法可能是什么问题?

运行我的应用后的控制台日志

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/04/04 14:15:31 INFO SparkContext: Running Spark version 2.2.0
18/04/04 14:15:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/04 14:15:32 WARN Utils: Your hostname, obel-pc0083 resolves to a loopback address: 127.0.1.1; using 10.96.20.75 instead (on interface eth0)
18/04/04 14:15:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/04/04 14:15:32 INFO SparkContext: Submitted application: Online Gateway Count
18/04/04 14:15:32 INFO Utils: Successfully started service 'sparkDriver' on port 45111.
18/04/04 14:15:32 INFO SparkEnv: Registering MapOutputTracker
18/04/04 14:15:32 INFO SparkEnv: Registering BlockManagerMaster
18/04/04 14:15:32 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/04/04 14:15:32 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/04/04 14:15:32 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-e7cfde5b-87f0-4447-a19e-771d100d7422
18/04/04 14:15:32 INFO MemoryStore: MemoryStore started with capacity 1137.6 MB
18/04/04 14:15:32 INFO SparkEnv: Registering OutputCommitCoordinator
18/04/04 14:15:32 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/04/04 14:15:32 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.96.20.75:4040
18/04/04 14:15:33 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.96.20.75:7077...
18/04/04 14:15:33 INFO TransportClientFactory: Successfully created connection to /10.96.20.75:7077 after 59 ms (0 ms spent in bootstraps)
18/04/04 14:15:33 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20180404141533-0009
18/04/04 14:15:33 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39062.
18/04/04 14:15:33 INFO NettyBlockTransferService: Server created on 10.96.20.75:39062
18/04/04 14:15:33 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/04/04 14:15:33 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180404141533-0009/0 on worker-20180403185515-10.96.20.75-38166 (10.96.20.75:38166) with 4 cores
18/04/04 14:15:33 INFO StandaloneSchedulerBackend: Granted executor ID app-20180404141533-0009/0 on hostPort 10.96.20.75:38166 with 4 cores, 1024.0 MB RAM
18/04/04 14:15:33 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.96.20.75, 39062, None)
18/04/04 14:15:33 INFO BlockManagerMasterEndpoint: Registering block manager 10.96.20.75:39062 with 1137.6 MB RAM, BlockManagerId(driver, 10.96.20.75, 39062, None)
18/04/04 14:15:33 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.96.20.75, 39062, None)
18/04/04 14:15:33 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.96.20.75, 39062, None)
18/04/04 14:15:33 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180404141533-0009/0 is now RUNNING
18/04/04 14:15:33 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
18/04/04 14:15:34 INFO Native: Could not load JNR C Library, native system calls through this library will not be available (set this logger level to DEBUG to see the full stack trace).
18/04/04 14:15:34 INFO ClockFactory: Using java.lang.System clock to generate timestamps.
18/04/04 14:15:35 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
18/04/04 14:15:36 INFO Cluster: New Cassandra host /10.96.20.75:9042 added
18/04/04 14:15:36 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
18/04/04 14:15:36 INFO SparkContext: Starting job: count at SearchCount.scala:47
18/04/04 14:15:36 INFO DAGScheduler: Registering RDD 4 (distinct at SearchCount.scala:47)
18/04/04 14:15:36 INFO DAGScheduler: Got job 0 (count at SearchCount.scala:47) with 6 output partitions
18/04/04 14:15:36 INFO DAGScheduler: Final stage: ResultStage 1 (count at SearchCount.scala:47)
18/04/04 14:15:36 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/04/04 14:15:36 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/04/04 14:15:36 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[4] at distinct at SearchCount.scala:47), which has no missing parents
18/04/04 14:15:37 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 9.6 KB, free 1137.6 MB)
18/04/04 14:15:37 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.2 KB, free 1137.6 MB)
18/04/04 14:15:37 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.96.20.75:39062 (size: 5.2 KB, free: 1137.6 MB)
18/04/04 14:15:37 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
18/04/04 14:15:37 INFO DAGScheduler: Submitting 6 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[4] at distinct at SearchCount.scala:47) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5))
18/04/04 14:15:37 INFO TaskSchedulerImpl: Adding task set 0.0 with 6 tasks
18/04/04 14:15:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.96.20.75:43727) with ID 0
18/04/04 14:15:37 INFO BlockManagerMasterEndpoint: Registering block manager 10.96.20.75:46125 with 366.3 MB RAM, BlockManagerId(0, 10.96.20.75, 46125, None)
18/04/04 14:15:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.96.20.75, executor 0, partition 0, NODE_LOCAL, 12327 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.96.20.75, executor 0, partition 1, NODE_LOCAL, 11729 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.96.20.75, executor 0, partition 2, NODE_LOCAL, 13038 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.96.20.75, executor 0, partition 3, NODE_LOCAL, 12445 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.96.20.75, executor 0, partition 4, NODE_LOCAL, 12209 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.96.20.75, executor 0, partition 5, NODE_LOCAL, 6864 bytes)
18/04/04 14:15:38 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.96.20.75, executor 0): java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:309)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

1 个答案:

答案 0 :(得分:1)

编辑:我真的错过isbn_exists = set() with open(file_name, 'r') as f: for line in f: isbn_data = #get data from line isbn_exists.add(isbn_data) isbn_from_message = set() for data in message: isbn_from_message.add(data) isbn_to_write = isbn_from_message - isbn_exists with open(file_name, 'a') as f: for data in isbn_to_write: f.write(data) 被评论出来了......

指定1.5.0-RC1依赖关系应该足够了 - 它已经依赖于cassandra-spark-connector&amp; spark-core。但是如果你使用Spark 2.x,你需要使用2.x版本的spark-sql(尽管它依赖于2.0.2,它可以使用2.2.0)。 / p>

我不知道你在哪里采用版本cassandra-spark-connector - 它已经老了......