我正在将Spring Boot应用程序提交到spark 2集群,该集群从Hive读取一些数据,使用FlatMapFunction对其进行转换,然后将其写入Hive。加载Hive数据后,它将失败,并在FlatMapFunction上显示ClassNotFoundException。
如果不使用Spring Boot,代码可以正常工作。当我使用'spring-boot-maven-plugin'将项目转换为Spring Boot时,将Spark会话创建为@Configuration类并将其注入到主应用程序类@SpringBootApplication中,在获取数据后出现错误从Hive加载。
我没有使用Spring将火花处理代码连接在一起,它是手动完成的。我也尝试过使用该函数的匿名实现。
命令行
spark-submit --master yarn --deploy-mode client --driver-memory 20g --num-executors 5 --executor-cores 4 --executor-memory 12g application.jar
来自pom.xml
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<fork>true</fork>
<executable>true</executable>
</configuration>
<executions>
<execution>
<goals>
<goal>repackage</goal>
</goals>
<configuration>
<mainClass>com.ced.spark.patterndetection.Driver</mainClass>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
我在应用程序jar文件中的类:
2465 Wed Sep 11 08:46:14 NZST 2019 BOOT-INF/classes/com/ced/spark/patterndetection/Analyser.class
1667 Wed Sep 11 08:46:14 NZST 2019 BOOT-INF/classes/com/ced/spark/patterndetection/Driver.class
1638 Wed Sep 11 08:46:14 NZST 2019 BOOT-INF/classes/com/ced/spark/patterndetection/SparkSessionConfiguration.class
2242 Wed Sep 11 08:46:14 NZST 2019 BOOT-INF/classes/com/ced/spark/patterndetection/Transformer.class
应用程序类
@SpringBootApplication( exclude = { org.springframework.boot.autoconfigure.gson.GsonAutoConfiguration.class } )
public class Driver implements CommandLineRunner, Serializable {
private final SparkSession spark;
@Inject
public Driver( SparkSession spark ){
this.spark = spark;
}
public static void main( String[] args ) {
SpringApplication.run( Driver.class , args );
}
@Override
public void run( String... args ) {
new Analyser( this.spark , new Transformer( ) ).doit();
}
}
public class Analyser implements Serializable {
private final SparkSession spark;
private final Transformer transformer;
Analyser( SparkSession spark , Transformer transformer ){
this.spark = spark;
this.transformer = transformer;
}
void doit(){
spark.sql( "USE ced_campaign_analysis" );
final Dataset< Row > dataFrame = spark.sql( "SELECT * FROM src" );
dataFrame.show();
final JavaRDD<Row> matchResults = dataFrame.javaRDD().mapPartitions( this.transformer );
matchResults.collect().stream().forEach( System.out::println );
}
}
有问题的功能实现
public class Transformer implements FlatMapFunction< Iterator<Row>, Row>, Serializable {
@Override
public Iterator< Row > call( Iterator< Row > inputs ) {
final List<Row> outputs = new ArrayList<>();
while ( inputs.hasNext( ) ) {
final Row input = inputs.next( );
final List< String > output = new ArrayList<>();
output.add( "qqqq=>"+input.get( 0 ) );
output.add( "qqqq=>"+input.get( 1 ) );
outputs.add( RowFactory.create( output.toArray( ) ) );
}
return outputs.iterator();
}
}
答案 0 :(得分:0)
重新审视此问题,我发现了一种在Apache Spark中运行Spring Boot应用程序的更好方法:请参考https://radanalytics.io/assets/my-first-radanalytics-app/sparkpi-java-spring.html。
解决方案的重点是指定将spring-boot-maven-plugin生成的原始jar文件提交给spark(而不是运行spring boot的经过修改的jar文件)
final SparkSession spark = SparkSession
.builder( )
.config("spark.jars", config.originalJarFile( ) )
...
.getOrCreate( );
此jar文件的名称由此插件设置
<plugin>
<groupId>com.coderplus.maven.plugins</groupId>
<artifactId>copy-rename-maven-plugin</artifactId>
<version>1.0.1</version>
<executions>
<execution>
<id>rename-file</id>
<phase>package</phase>
<goals>
<goal>rename</goal>
</goals>
<configuration>
<sourceFile>target/${project.name}-${project.version}.jar.original</sourceFile>
<destinationFile>target/${project.name}-${project.version}-original.jar</destinationFile>
</configuration>
</execution>
</executions>
</plugin>
此属性中捕获了jar文件的名称,该属性利用了maven构建参数
jar-file: /home/xxxxxx/jars/@project.name@-@project.version@-original.jar