使用IDE在远程Spark集群上运行Spark应用程序时出错

时间:2017-12-14 09:40:20

标签: maven apache-spark intellij-idea

我有一个Spark应用程序,它从Kafka读取数据并进行处理。使用maven和命令创建胖jar:mvn clean compile assembly:single,我可以使用spark-submit命令工具(No Yarn,只是独立群集)将其成功提交到Spark远程群集。现在我尝试运行相同的应用程序而不直接从IntelliJ IDE生成胖jar。在IDE中运行应用程序之后,它在集群的Master中提交了一个作业但是在一段时间后出现错误:

java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka010.KafkaRDDPartition

我认为Spark应用程序无法访问POM.xml文件中的依赖项。

这是POM.xml文件:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>Saprk</groupId>
    <artifactId>SparkPMUProcessing</artifactId>
    <version>1.0-SNAPSHOT</version>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>SparkTest</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
            </plugin>
        </plugins>
    </build>
    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka_2.11</artifactId>
            <version>0.10.0.0</version>
        </dependency>
    </dependencies>
</project>

要点:我在远程群集上运行Apache Flink应用程序时遇到同样的问题。 Flink以及Spark,使用fat jar和terminal命令正确运行以提交到集群。

更新:使用方法setJars我引入了依赖jar文件,并且java.lang.ClassNotFoundException:错误类型消失了。现在它说:

java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1

这是我的代码:

    public class SparkTest {

    public static void main(String[] args) throws InterruptedException {
        SparkConf conf = new SparkConf().setAppName("PMUStreaming").setMaster("spark://namenode1:7077")
                .set("spark.deploy.mode", "client")
                .set("spark.executor.memory", "700m").setJars(new String[]{
                        "/home/xxx/SparkRunningJars/kafka_2.11-0.10.0.0.jar",
                        "/home/xxx/SparkRunningJars/kafka-clients-0.10.0.0.jar",
                        "/home/xxx/SparkRunningJars/spark-streaming-kafka-0-10_2.11-2.2.0.jar"
                });
        Map<String, Object> kafkaParams = new HashMap<>();

        Collection<String> TOPIC = Arrays.asList(args[6]);
        final String BOOTSTRAPSERVERS = args[0];
        final String ZOOKEEPERSERVERS = args[1];
        final String ID = args[2];
        final int BATCH_SIZE = Integer.parseInt(args[3]);
        final String PATH = args[4];
        final String READMETHOD = args[5];

        kafkaParams.put("bootstrap.servers", BOOTSTRAPSERVERS);
        kafkaParams.put("key.deserializer", StringDeserializer.class);
        kafkaParams.put("value.deserializer", ByteArrayDeserializer.class);
        kafkaParams.put("group.id", ID);
        kafkaParams.put("auto.offset.reset", READMETHOD);
        kafkaParams.put("enable.auto.commit", false);
        kafkaParams.put("metadata.max.age.ms", 30000);

        JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(BATCH_SIZE));
        JavaInputDStream<ConsumerRecord<String, byte[]>> stream = KafkaUtils.createDirectStream(
                ssc,
                LocationStrategies.PreferConsistent(),
                ConsumerStrategies.<String, byte[]>Subscribe(TOPIC, kafkaParams)
        );


        stream.map(record -> getTime(record.value()) + ":"
                + Long.toString(System.currentTimeMillis()) + ":"
                + Arrays.deepToString(finall(record.value()))
                + ":" + Long.toString(System.currentTimeMillis()))
                .map(record -> record + ":"
                        + Long.toString(Long.parseLong(record.split(":")[3]) - Long.parseLong(record.split(":")[1])))
                .repartition(1)
                .foreachRDD(new VoidFunction2<JavaRDD<String>, Time>() {
                    private static final long serialVersionUID = 1L;

                    @Override
                    public void call(JavaRDD<String> rdd, Time time) throws Exception {
                        if (rdd.count() > 0) {
                            rdd.saveAsTextFile(PATH + "/" + time.milliseconds());
                        }
                    }
                });
        ssc.start();
        ssc.awaitTermination();
    }

2 个答案:

答案 0 :(得分:1)

你看到过这个答案吗?这可能有所帮助。

java.lang.ClassCastException using lambda expressions in spark job on remote server

  

如果您从IDE运行代码,只需在您的SparkConf实例上调用setJars(new String [] {&#34; /path/to/jar/with/your/class.jar"})。 spark-submit默认分配你的jar,所以没有这样的问题

<强>更新 您还必须添加项目的jar。

所以代码应该是

public class SparkTest {

    public static void main(String[] args) throws InterruptedException {
        SparkConf conf = new SparkConf().setAppName("PMUStreaming").setMaster("spark://namenode1:7077")
                .set("spark.deploy.mode", "client")
                .set("spark.executor.memory", "700m").setJars(new String[]{
                        "/home/xxx/SparkRunningJars/kafka_2.11-0.10.0.0.jar",
                        "/home/xxx/SparkRunningJars/kafka-clients-0.10.0.0.jar",
                        "/home/xxx/SparkRunningJars/spark-streaming-kafka-0-10_2.11-2.2.0.jar",
                        "/path/to/your/project/target/SparkPMUProcessing-1.0-SNAPSHOT.jar"
                });
        Map<String, Object> kafkaParams = new HashMap<>();

        Collection<String> TOPIC = Arrays.asList(args[6]);
        final String BOOTSTRAPSERVERS = args[0];
        final String ZOOKEEPERSERVERS = args[1];
        final String ID = args[2];
        final int BATCH_SIZE = Integer.parseInt(args[3]);
        final String PATH = args[4];
        final String READMETHOD = args[5];

        kafkaParams.put("bootstrap.servers", BOOTSTRAPSERVERS);
        kafkaParams.put("key.deserializer", StringDeserializer.class);
        kafkaParams.put("value.deserializer", ByteArrayDeserializer.class);
        kafkaParams.put("group.id", ID);
        kafkaParams.put("auto.offset.reset", READMETHOD);
        kafkaParams.put("enable.auto.commit", false);
        kafkaParams.put("metadata.max.age.ms", 30000);

        JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(BATCH_SIZE));
        JavaInputDStream<ConsumerRecord<String, byte[]>> stream = KafkaUtils.createDirectStream(
                ssc,
                LocationStrategies.PreferConsistent(),
                ConsumerStrategies.<String, byte[]>Subscribe(TOPIC, kafkaParams)
        );


        stream.map(record -> getTime(record.value()) + ":"
                + Long.toString(System.currentTimeMillis()) + ":"
                + Arrays.deepToString(finall(record.value()))
                + ":" + Long.toString(System.currentTimeMillis()))
                .map(record -> record + ":"
                        + Long.toString(Long.parseLong(record.split(":")[3]) - Long.parseLong(record.split(":")[1])))
                .repartition(1)
                .foreachRDD(new VoidFunction2<JavaRDD<String>, Time>() {
                    private static final long serialVersionUID = 1L;

                    @Override
                    public void call(JavaRDD<String> rdd, Time time) throws Exception {
                        if (rdd.count() > 0) {
                            rdd.saveAsTextFile(PATH + "/" + time.milliseconds());
                        }
                    }
                });
        ssc.start();
        ssc.awaitTermination();
    }
}

答案 1 :(得分:0)

这是我的build.sbt依赖项。它是sbt配置,但您可以识别指定依赖项所需的内容。

lazy val commonLibraryDependencies = Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
  "org.apache.spark" %% f"spark-streaming-kafka-$kafkaVersion" % sparkVersion,
  "org.apache.spark" %% f"spark-sql-kafka-$kafkaVersion" % sparkVersion,
)