我在本地使用Spark独立,我使用Maven作为构建自动化工具。所以我为spark和简单的JSON设置了所有必需的依赖项。我运行了我的Spark应用程序,用于简单的应用程序,如字数,但是当我从Simple JSON api导入JSONParser时,我得到Class not found异常。我曾尝试使用sparkconfig和spark context添加jar文件,但它仍然无法帮助我。
以下是我的pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org</groupId>
<artifactId>sparketl</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>sparketl</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.googlecode.json-simple</groupId>
<artifactId>json-simple</artifactId>
<version>1.1.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
我的司机班是:
package org.sparketl.etljobs;
import java.util.Arrays;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
/**
* @author vijith.reddy
*
*/
public final class SparkEtl {
public static void main(String[] args) throws Exception {
if (args.length < 3) {
System.err
.println("Please use: SparkEtl <master> <input file> <output file>");
System.exit(1);
}
@SuppressWarnings("resource")
JavaSparkContext spark = new JavaSparkContext(args[0],
"Json ", System.getenv("SPARK_HOME"),
JavaSparkContext.jarOfClass(SparkEtl.class));
//SparkConf sc=new SparkConf();
//sc.setJars(new String[]{"/Users/username/.m2/repository/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1-sources.jar"});
spark.addJar("/Users/username/.m2/repository/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1-sources.jar");
JavaRDD<String> file = spark.textFile(args[1]);
FlatMapFunction<String, String> jsonLine = jsonFile -> {
return Arrays.asList(jsonFile.toLowerCase().split("\\r?\\n"));
};
JavaRDD<String> eachLine = file.flatMap(jsonLine);
PairFunction<String, String, String> mapCountry = eachItem -> {
JSONParser parser = new JSONParser();
String country = "";
try {
Object obj = parser.parse(eachItem);
JSONObject jsonObj = (JSONObject) obj;
country = (String) jsonObj.get("country");
} catch (Exception e) {
e.printStackTrace();
}
return new Tuple2<String, String>(eachItem, country);
};
JavaPairRDD<String, String> pairs = eachLine.mapToPair(mapCountry);
pairs.sortByKey(true).saveAsTextFile(args[2]);
System.exit(0);
}
}
我的日志中出现以下错误:
15/07/08 16:09:17 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
15/07/08 16:09:17 INFO SparkContext: Added JAR /Users/username/.m2/repository/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1-sources.jar at http://172.16.8.157:52255/jars/json-simple-1.1.1-sources.jar with timestamp 1436396957111
15/07/08 16:09:17 INFO MemoryStore: ensureFreeSpace(110248) called with curMem=0, maxMem=278019440
15/07/08 16:09:17 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 107.7 KB, free 265.0 MB)
15/07/08 16:09:17 INFO MemoryStore: ensureFreeSpace(10090) called with curMem=110248, maxMem=278019440
15/07/08 16:09:17 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 9.9 KB, free 265.0 MB)
15/07/08 16:09:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.8.157:52257 (size: 9.9 KB, free: 265.1 MB)
15/07/08 16:09:17 INFO SparkContext: Created broadcast 0 from textFile at SparkEtl.java:35
15/07/08 16:09:17 INFO FileInputFormat: Total input paths to process : 1
15/07/08 16:09:17 INFO SparkContext: Starting job: sortByKey at SparkEtl.java:58
15/07/08 16:09:17 INFO DAGScheduler: Got job 0 (sortByKey at SparkEtl.java:58) with 2 output partitions (allowLocal=false)
15/07/08 16:09:17 INFO DAGScheduler: Final stage: ResultStage 0(sortByKey at SparkEtl.java:58)
15/07/08 16:09:17 INFO DAGScheduler: Parents of final stage: List()
15/07/08 16:09:17 INFO DAGScheduler: Missing parents: List()
15/07/08 16:09:17 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[5] at sortByKey at SparkEtl.java:58), which has no missing parents
15/07/08 16:09:17 INFO MemoryStore: ensureFreeSpace(5248) called with curMem=120338, maxMem=278019440
15/07/08 16:09:17 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.1 KB, free 265.0 MB)
15/07/08 16:09:17 INFO MemoryStore: ensureFreeSpace(2888) called with curMem=125586, maxMem=278019440
15/07/08 16:09:17 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.8 KB, free 265.0 MB)
15/07/08 16:09:17 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.16.8.157:52257 (size: 2.8 KB, free: 265.1 MB)
15/07/08 16:09:17 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
15/07/08 16:09:17 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[5] at sortByKey at SparkEtl.java:58)
15/07/08 16:09:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/07/08 16:09:18 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@172.16.8.157:52260/user/Executor#2100827222]) with ID 0
15/07/08 16:09:18 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.16.8.157, PROCESS_LOCAL, 1560 bytes)
15/07/08 16:09:18 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 172.16.8.157, PROCESS_LOCAL, 1560 bytes)
15/07/08 16:09:18 INFO BlockManagerMasterEndpoint: Registering block manager 172.16.8.157:52263 with 265.1 MB RAM, BlockManagerId(0, 172.16.8.157, 52263)
15/07/08 16:09:18 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.16.8.157:52263 (size: 2.8 KB, free: 265.1 MB)
15/07/08 16:09:18 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.8.157:52263 (size: 9.9 KB, free: 265.1 MB)
15/07/08 16:09:19 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 172.16.8.157): java.lang.NoClassDefFoundError: org/json/simple/parser/JSONParser
at org.sparketl.etljobs.SparkEtl.lambda$main$b9f570ea$1(SparkEtl.java:44)
at org.sparketl.etljobs.SparkEtl$$Lambda$11/1498038525.call(Unknown Source)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1030)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1030)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:42)
at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259)
at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/07/08 16:09:19 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on executor 172.16.8.157: java.lang.NoClassDefFoundError (org/json/simple/parser/JSONParser) [duplicate 1]
我的Spark配置
spark.executor.memory 512m
spark.driver.cores 1
spark.driver.memory 512m
spark.driver.extraClassPath /Users/username/.m2/repository/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1-sources.jar
有没有人遇到过这个问题?如果是这样,那么解决方案是什么呢?
答案 0 :(得分:3)
根据spark.driver.extraClassPath
(和代码库) - 提供给Spark的库是一个源库(json-simple-1.1.1-sources.jar
)。该库可能只包含java文件(源文件,而不是编译的java类)。
将其更改为json-simple-1.1.1.jar
(当然是完整路径)应该有所帮助。