您好我正在尝试编写从Cassandra读取数据的Spark应用程序。我的Scala版本是2.11,Spark版本是2.2.0。不幸的是我面临构建问题。它说“在加载类文件'package.class'时检测到缺失或无效的依赖项。我不知道是什么导致了这个问题。
这是我的POM文件
<properties>
<maven.compiler.source>1.6</maven.compiler.source>
<maven.compiler.target>1.6</maven.compiler.target>
<encoding>UTF-8</encoding>
<!--scala.tools.version>2.11.8</scala.tools.version-->
<scala.version>2.11.8</scala.version>
</properties>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<!-- see http://davidb.github.com/scala-maven-plugin -->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.1.3</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<!--arg>-make:transitive</arg-->
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.13</version>
<configuration>
<useFile>false</useFile>
<disableXmlReport>true</disableXmlReport>
<!-- If you have classpath issue like NoDefClassError,... -->
<!-- useManifestOnlyJar>false</useManifestOnlyJar -->
<includes>
<include>**/*Test.*</include>
<include>**/*Suite.*</include>
</includes>
</configuration>
</plugin>
<!-- "package" command plugin -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4.1</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<!-- Scala and Spark dependencies -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-xml</artifactId>
<version>2.11.0-M4</version>
</dependency>
<dependency>
<groupId>org.scala-lang.modules</groupId>
<artifactId>scala-parser-combinators_2.11</artifactId>
<version>1.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.0.7</version>
</dependency>
<!--dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java_2.11</artifactId>
<version>1.5.0-RC1</version>
</dependency-->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.12</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.7.1</version>
</dependency>
</dependencies>
我收到以下错误
[INFO] --- maven-resources-plugin:2.3:resources (default-resources) @ search-count ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO]
[INFO] --- maven-compiler-plugin:2.0.2:compile (default-compile) @ search-count ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- scala-maven-plugin:3.1.3:compile (default) @ search-count ---
[WARNING] Expected all dependencies to require Scala version: 2.11.8
[WARNING] search-count:search-count:0.0.1-SNAPSHOT requires scala version: 2.11.8
[WARNING] org.scala-lang.modules:scala-parser-combinators_2.11:1.0.2 requires scala version: 2.11.1
[WARNING] Multiple versions of scala libraries detected!
[ERROR] error: missing or invalid dependency detected while loading class file 'package.class'.
[INFO] Could not access type DataFrame in value org.apache.spark.sql.package,
[INFO] because it (or its dependencies) are missing. Check your build definition for
[INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
[INFO] A full rebuild may help if 'package.class' was compiled against an incompatible version of org.apache.spark.sql.package.
[ERROR] one error found
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.052s
[INFO] Finished at: Wed Apr 04 11:33:51 CEST 2018
[INFO] Final Memory: 22M/425M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.1.3:compile (default) on project search-count: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 1(Exit value: 1) -> [Help 1]
任何想法可能是什么问题?
运行我的应用后的控制台日志
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/04/04 14:15:31 INFO SparkContext: Running Spark version 2.2.0
18/04/04 14:15:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/04 14:15:32 WARN Utils: Your hostname, obel-pc0083 resolves to a loopback address: 127.0.1.1; using 10.96.20.75 instead (on interface eth0)
18/04/04 14:15:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/04/04 14:15:32 INFO SparkContext: Submitted application: Online Gateway Count
18/04/04 14:15:32 INFO Utils: Successfully started service 'sparkDriver' on port 45111.
18/04/04 14:15:32 INFO SparkEnv: Registering MapOutputTracker
18/04/04 14:15:32 INFO SparkEnv: Registering BlockManagerMaster
18/04/04 14:15:32 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/04/04 14:15:32 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/04/04 14:15:32 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-e7cfde5b-87f0-4447-a19e-771d100d7422
18/04/04 14:15:32 INFO MemoryStore: MemoryStore started with capacity 1137.6 MB
18/04/04 14:15:32 INFO SparkEnv: Registering OutputCommitCoordinator
18/04/04 14:15:32 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/04/04 14:15:32 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.96.20.75:4040
18/04/04 14:15:33 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.96.20.75:7077...
18/04/04 14:15:33 INFO TransportClientFactory: Successfully created connection to /10.96.20.75:7077 after 59 ms (0 ms spent in bootstraps)
18/04/04 14:15:33 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20180404141533-0009
18/04/04 14:15:33 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39062.
18/04/04 14:15:33 INFO NettyBlockTransferService: Server created on 10.96.20.75:39062
18/04/04 14:15:33 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/04/04 14:15:33 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180404141533-0009/0 on worker-20180403185515-10.96.20.75-38166 (10.96.20.75:38166) with 4 cores
18/04/04 14:15:33 INFO StandaloneSchedulerBackend: Granted executor ID app-20180404141533-0009/0 on hostPort 10.96.20.75:38166 with 4 cores, 1024.0 MB RAM
18/04/04 14:15:33 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.96.20.75, 39062, None)
18/04/04 14:15:33 INFO BlockManagerMasterEndpoint: Registering block manager 10.96.20.75:39062 with 1137.6 MB RAM, BlockManagerId(driver, 10.96.20.75, 39062, None)
18/04/04 14:15:33 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.96.20.75, 39062, None)
18/04/04 14:15:33 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.96.20.75, 39062, None)
18/04/04 14:15:33 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180404141533-0009/0 is now RUNNING
18/04/04 14:15:33 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
18/04/04 14:15:34 INFO Native: Could not load JNR C Library, native system calls through this library will not be available (set this logger level to DEBUG to see the full stack trace).
18/04/04 14:15:34 INFO ClockFactory: Using java.lang.System clock to generate timestamps.
18/04/04 14:15:35 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
18/04/04 14:15:36 INFO Cluster: New Cassandra host /10.96.20.75:9042 added
18/04/04 14:15:36 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
18/04/04 14:15:36 INFO SparkContext: Starting job: count at SearchCount.scala:47
18/04/04 14:15:36 INFO DAGScheduler: Registering RDD 4 (distinct at SearchCount.scala:47)
18/04/04 14:15:36 INFO DAGScheduler: Got job 0 (count at SearchCount.scala:47) with 6 output partitions
18/04/04 14:15:36 INFO DAGScheduler: Final stage: ResultStage 1 (count at SearchCount.scala:47)
18/04/04 14:15:36 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/04/04 14:15:36 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/04/04 14:15:36 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[4] at distinct at SearchCount.scala:47), which has no missing parents
18/04/04 14:15:37 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 9.6 KB, free 1137.6 MB)
18/04/04 14:15:37 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.2 KB, free 1137.6 MB)
18/04/04 14:15:37 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.96.20.75:39062 (size: 5.2 KB, free: 1137.6 MB)
18/04/04 14:15:37 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
18/04/04 14:15:37 INFO DAGScheduler: Submitting 6 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[4] at distinct at SearchCount.scala:47) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5))
18/04/04 14:15:37 INFO TaskSchedulerImpl: Adding task set 0.0 with 6 tasks
18/04/04 14:15:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.96.20.75:43727) with ID 0
18/04/04 14:15:37 INFO BlockManagerMasterEndpoint: Registering block manager 10.96.20.75:46125 with 366.3 MB RAM, BlockManagerId(0, 10.96.20.75, 46125, None)
18/04/04 14:15:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.96.20.75, executor 0, partition 0, NODE_LOCAL, 12327 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.96.20.75, executor 0, partition 1, NODE_LOCAL, 11729 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.96.20.75, executor 0, partition 2, NODE_LOCAL, 13038 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.96.20.75, executor 0, partition 3, NODE_LOCAL, 12445 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.96.20.75, executor 0, partition 4, NODE_LOCAL, 12209 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.96.20.75, executor 0, partition 5, NODE_LOCAL, 6864 bytes)
18/04/04 14:15:38 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.96.20.75, executor 0): java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:309)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
答案 0 :(得分:1)
编辑:我真的错过isbn_exists = set()
with open(file_name, 'r') as f:
for line in f:
isbn_data = #get data from line
isbn_exists.add(isbn_data)
isbn_from_message = set()
for data in message:
isbn_from_message.add(data)
isbn_to_write = isbn_from_message - isbn_exists
with open(file_name, 'a') as f:
for data in isbn_to_write:
f.write(data)
被评论出来了......
指定1.5.0-RC1
依赖关系应该足够了 - 它已经依赖于cassandra-spark-connector
&amp; spark-core
。但是如果你使用Spark 2.x,你需要使用2.x版本的spark-sql
(尽管它依赖于2.0.2,它可以使用2.2.0)。 / p>
我不知道你在哪里采用版本cassandra-spark-connector
- 它已经老了......