我对Spark世界很陌生,需要帮助解决这个微不足道的问题。
我在同一台机器上有一个主机和一个从机运行的本地火花。
我正在使用Spring Boot
和Gradle
创建我的Java应用程序,将作业提交给Spark实例。
我有一个服务类获取JavaSparkContext
:
public void loadTransactions(JavaSparkContext context) {
try {
List<TransactionRecord> transactionRecordList = new ArrayList<>();
Iterable<TransactionRecord> all = trxRecordRepository.findAll();
all.forEach(trx -> transactionRecordList.add(trx));
System.out.println("Trx array list ready: "+ transactionRecordList.size());
JavaRDD<TransactionRecord> trxRecordRDD = context.parallelize(transactionRecordList, 4);
System.out.println(trxRecordRDD.count());
System.out.println("data frame loaded");
}catch (Exception e) {
logger.error("Error while loading transactions", e.getCause());
}finally {
context.close();
}
}
当我执行此方法时,spring数据jpa中的transactionRecordRepository
成功填充了List。
Spark作业开始执行,但随后出现以下错误:
2017-03-08 10:28:44.888 WARN 9021 --- [result-getter-2] o.apache.spark.scheduler.TaskSetManager : Lost task 1.0 in stage 0.0 (TID 1, 10.20.12.216, executor 0): java.io.IOException: java.lang.ClassNotFoundException: learning.spark.models.TransactionRecord
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2122)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:258)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: learning.spark.models.TransactionRecord
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1819)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1986)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1919)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:552)
at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:74)
at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:70)
at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:70)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
... 20 more
如果我从文本文件加载更简单的数据,一切正常。
JavaRDD<String> movieData = context.textFile("/Users/panshul/Development/sparkDataDump/ratings.csv", 4);
count = movieData.count();
我的gradle构建文件:
buildscript {
ext {
springBootVersion = '1.5.2.RELEASE'
}
repositories {
mavenCentral()
}
dependencies {
classpath("org.springframework.boot:spring-boot-gradle-plugin:${springBootVersion}")
}
}
apply plugin: 'java'
apply plugin: 'eclipse'
apply plugin: 'idea'
apply plugin: 'org.springframework.boot'
jar {
baseName = 'spark-example'
version = '0.0.1-SNAPSHOT'
}
sourceCompatibility = 1.8
targetCompatibility = 1.8
repositories {
mavenCentral()
mavenLocal()
}
dependencies {
compile('org.springframework.boot:spring-boot-starter-web') {
exclude module: "spring-boot-starter-tomcat"
}
compile("org.springframework.boot:spring-boot-starter-jetty")
compile("org.springframework.boot:spring-boot-starter-actuator")
compile("org.springframework.boot:spring-boot-starter-data-jpa")
compile("mysql:mysql-connector-java:6.0.5")
compile("org.codehaus.janino:janino:3.0.6")
compile("org.apache.spark:spark-core_2.11:2.1.0")
{
exclude group: "org.slf4j", module: "slf4j-log4j12"
}
compile("org.apache.spark:spark-sql_2.11:2.1.0")
{
exclude group: "org.slf4j", module: "slf4j-log4j12"
}
testCompile("org.springframework.boot:spring-boot-starter-test")
testCompile("junit:junit")
}
请帮我弄清楚我在这里做错了什么。
使用Spark version 2.1.0
从Spark网站下载并安装。
在MacOs Sierra上运行。
答案 0 :(得分:1)
我认为您的问题是,当您提交作业时,您的班级 learning.spark.models.TransactionRecord 不会包含在课程路径中。
你必须在spark-submit --jars参数中指定所有依赖的jar,否则你必须创建一个包含所有依赖项的大jar。
我认为最简单的方法就是提交多个这样的罐子:
$SPARK_HOME/bin/spark-submit --name yourApp --class yourMain.class --master yourMaster --jars dependencyA.jar, dependencyB.jar, job.jar
答案 1 :(得分:0)
我必须创建一个我正在使用的所有自定义类的jar,并将它们放在我的Apache-Spark安装的jars
文件夹中。
这使得spark master发现了我的自定义RDD类型类并被传播给了worker。