首先,我尝试使用以下命令在纱线群集上部署Spark Java应用程序:
spark-submit --master yarn --class com.batchjob.BatchJob D:\batchjob-0.0.1-SNAPSHOT-shaded.jar
我的Java代码:
public class BatchJob {
public static void main(String[] args) throws IOException {
// get spark confgiruation
SparkConf sparkConf = new SparkConf().setAppName("Example Spark App");//.setMaster("local");
// setup spark session to be able to work with Dataset
SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate();
// import data
Dataset<Row> input = spark.read().csv("hdfs://localhost:9000/input_dir/data.csv");
input.show();
// map to Dataset of Activity
Dataset<Activity> activityDataset = input.map((row) -> {
if (row.size() != 8)
throw new RuntimeException("Row must have size of 8!");
return new Activity(Long.parseLong(row.getString(0)), row.getString(1), row.getString(2), row.getString(3),
row.getString(4), row.getString(5), row.getString(6), row.getString(7));
}, Encoders.bean(Activity.class));
/*
* Actions & Transformations
*/
activityDataset.createOrReplaceTempView("activity");
Dataset<Row> sqlResult = spark.sql("SELECT " + "product, timestamp, referrer, "
+ "SUM( CASE WHEN action = 'page_view' THEN 1 ELSE 0 END) AS page_view_count, "
+ "SUM( CASE WHEN action = 'add_to_cart' THEN 1 ELSE 0 END) AS add_to_cart_count, "
+ "SUM( CASE WHEN action = 'purchase' THEN 1 ELSE 0 END) AS purchase_count " + "FROM activity "
+ "GROUP BY product, timestamp, referrer").cache();
sqlResult.write().partitionBy("referrer").mode(SaveMode.Append).parquet("hdfs://localhost:9000/lambda/batch1");
spark.close();
}
}
和我的 pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com</groupId>
<artifactId>batchjob</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>batchjob</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.3.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<shadedArtifactAttached>true</shadedArtifactAttached>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<artifactSet>
<includes>
<include>*:*</include>
</includes>
</artifactSet>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>reference.conf</resource>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
<resources>
<resource>
<directory>.</directory>
<includes>
<include>src/main/resources/*.*</include>
</includes>
</resource>
</resources>
</build>
</project>
使用命令.\HADOOP_HOME\sbin\start-yarn.cmd
和使用命令.\HADOOP_HOME\sbin\start-dfs.cmd
的节点启动纱线群集。 注意:我在Windows 10上!
出于测试目的,如果我在本地运行,一切都很好,我可以在http://localhost:9870/explorer.html#/上看到代码的结果。
当我尝试让纱线决定如何管理Java Spark应用程序时,使用 yarn 代替本地,而我面临下一个错误:
2018-08-31 16:32:00 INFO Client:54 - Deleted staging directory file:/C:/Users/razvan.parautiu/.sparkStaging/application_1535721878844_0003
2018-08-31 16:32:00 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:933)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:924)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:924)
at com.batchjob.BatchJob.main(BatchJob.java:33)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
我检查了其他具有相同错误的帖子,但不幸的是无法正常工作...