我正在尝试在 Hadoop/Cloudera 集群上运行 Java 应用程序。
我正在使用以下命令在 hadoop cloudera custer 上运行应用程序 jar:
spark-submit --master yarn TmcStreamingProcessor-0.0.1-SNAPSHOT.jar --deploy-mode cluster --driver-memory 2g --executor-memory 2g --jars TmcStreamingProcessor-0.0.1-SNAPSHOT.jar
它应该作为消费者从 Kafka 流中读取,而另一个 Java 应用程序作为生产者。读取流数据后,应用程序应使用 Spark Structured-Streaming 在 mongoDB 上写入,如下所示:
// write on MongoDB
StreamingQuery streamingQuery = rowDatasetCounted
.repartition(1)
.selectExpr("... several columns ...")
.writeStream()
// .format("console")
// .outputMode("complete")
.option("checkpointLocation", checkpointFolder)
.trigger(Trigger.ProcessingTime(Duration.apply(20, TimeUnit.SECONDS)))
.foreach(forEachWriterMongo)
.start();
该片段取自 TmcStreamingMongo 类,
遗憾的是,执行几秒钟后,我收到警告,但有以下异常:
WARN scheduler.TaskSetManager: Lost task 0.3 in stage 2.0
(TID 54, hadoop03.vmware.local, executor 4): java.lang.ClassNotFoundException:
TmcStreamingProcessor.utility.ForEachWriterMongo
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1868)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
在此警告之后,作业执行将停止并出现与之前报告的警告相同的错误。我尝试了很多解决方案,但我找不到这种奇怪行为的原因,因为缺少的类显然存在,甚至更奇怪,即使我删除了 ForEachWriterMongo 类中的所有主体,错误仍然显示。
TmcStreamingMongo:
import TmcStreamingProcessor.model.TmcPathRawTimestamp;
import TmcStreamingProcessor.utility.ForEachWriterMongo;
import TmcStreamingProcessor.utility.KafkaConfig;
import TmcStreamingProcessor.utility.SparkSessionBean;
import lombok.SneakyThrows;
import lombok.extern.slf4j.Slf4j;
import org.apache.spark.sql.*;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.Trigger;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Service;
import scala.concurrent.duration.Duration;
import java.io.Serializable;
import java.util.concurrent.TimeUnit;
import static org.apache.spark.sql.functions.*;
@Service
@Slf4j
public class TmcStreamingMongo implements Serializable {
@Autowired
@Qualifier("singletonSparkSession")
private SparkSessionBean sparkSessionBean;
@Autowired
private KafkaConfig kafkaConfig;
@Value("${kafka.topic.grouped}")
private String topic;
@Autowired
ForEachWriterMongo forEachWriterMongo;
@Value("${checkpointMongo.base.path}")
private String checkpointFolder;
@SneakyThrows
@Async("threadPoolTaskExecutor")
public void start() {
// ObjectMapper objectMapper = new ObjectMapper();
SparkSession spark = sparkSessionBean.getSpark();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", (String) kafkaConfig.getKafkaConfig().get("bootstrap.servers"))
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.load();
// Dataset<TmcPathRawTimestamp> tmcPathRawDataset = df.map((MapFunction<Row, TmcPathRawTimestamp>) msg -> {
// TmcPathRaw tmcPathRawMapped = objectMapper.readValue((byte[]) msg.get(1), TmcPathRaw.class);
//
// return new TmcPathRawTimestamp(
// tmcPathRawMapped, new Timestamp(tmcPathRawMapped.getAvgGpsTime().getTime()));
// }, Encoders.bean(TmcPathRawTimestamp.class));
//
StructType tmcStruct = new StructType()
"...
several columns
..."
df.printSchema();
Dataset<Row> tmcPathRawDatasetTemp = df
.select(from_json(col("value").cast(DataTypes.StringType), tmcStruct).as("json"));
tmcPathRawDatasetTemp.printSchema();
Dataset<TmcPathRawTimestamp> tmcPathRawDataset = tmcPathRawDatasetTemp
.selectExpr("...
several columns
...")
.as(Encoders.bean(TmcPathRawTimestamp.class));
RelationalGroupedDataset groupedTmcPathRawDataset = tmcPathRawDataset
.withWatermark("timestamp", "10 seconds")
.groupBy(
functions.window(functions.col("timestamp"), "1 minute"),
functions.col("...several columns..."),
);
Dataset<Row> rowDatasetCounted = groupedTmcPathRawDataset.agg("...aggregation processing...");
// write on MongoDB
StreamingQuery streamingQuery = rowDatasetCounted
.repartition(1)
.selectExpr("... several columns ...")
.writeStream()
// .format("console")
// .outputMode("complete")
.option("checkpointLocation", checkpointFolder)
.trigger(Trigger.ProcessingTime(Duration.apply(20, TimeUnit.SECONDS)))
.foreach(forEachWriterMongo)
.start();
streamingQuery.awaitTermination();
}
ForEachWriterMongoClass:
import com.mongodb.*;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.netcomgroup.eu.TmcStreamingProcessor.model.TmcPathGrouped;
import lombok.extern.slf4j.Slf4j;
import org.apache.spark.sql.ForeachWriter;
import org.apache.spark.sql.Row;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
@Slf4j
@Component
public class ForEachWriterMongo extends ForeachWriter<Row> implements Serializable {
@Value("${spring.data.mongodb.database}")
private String dbName;
@Value("${spring.data.mongodb.collection}")
private String dbCollection;
@Value("${spring.data.mongodb.host}")
private String dbHost;
@Value("${spring.data.mongodb.port}")
private int dbPort;
private MongoClient mongoClient;
private List<TmcPathGrouped> tmcPathGroupedList;
@Override
public boolean open(long l, long l1) {
ForEachWriterMongo.log.info("Open inserting data on MongoDB...");
mongoClient = this.getMongoClient();
tmcPathGroupedList = new ArrayList<>();
return true;
}
@Override
public void process(Row r) {
TmcPathGrouped tmcPathGrouped =
new TmcPathGrouped(
r.getInt(0), r.getInt(1), r.getInt(2), r.getInt(3), r.getLong(4),
r.getDouble(5), r.getTimestamp(6), r.getTimestamp(7),
r.getTimestamp(8), r.getTimestamp(9)
);
tmcPathGroupedList.add(tmcPathGrouped);
}
@Override
public void close(Throwable throwable) {
if(!tmcPathGroupedList.isEmpty()){
this.write();
}
tmcPathGroupedList.clear();
mongoClient.close();
}
@Transactional("mongoDbTransactionManager")
public void write() {
ForEachWriterMongo.log.info("Start insert on Mongo...");
mongoClient.getDatabase(this.dbName).getCollection(this.dbCollection).insertMany(tmcPathGroupedList);
ForEachWriterMongo.log.info("Finished insert on Mongo...");
}
public MongoClient getMongoClient() {
final ConnectionString connectionString = new ConnectionString("mongodb://" + this.dbHost + ":" +
this.dbPort + "/" + dbName);
final MongoClientSettings mongoClientSettings = MongoClientSettings.builder()
.writeConcern(WriteConcern.MAJORITY)
.readConcern(ReadConcern.MAJORITY)
.readPreference(ReadPreference.primary())
.applyConnectionString(connectionString)
.build();
return MongoClients.create(mongoClientSettings);
}
}
背景信息:
带版本的详细组件:
{
"name": "hadoop",
"pkg_release": "1605554",
"pkg_version": "3.0.0+cdh6.3.2",
"version": "3.0.0-cdh6.3.2"
},
{
"name": "parquet",
"pkg_release": "1605554",
"pkg_version": "1.9.0+cdh6.3.2",
"version": "1.9.0-cdh6.3.2"
},
{
"name": "kafka",
"pkg_release": "1605554",
"pkg_version": "2.2.1+cdh6.3.2",
"version": "2.2.1-cdh6.3.2"
},
{
"name": "solr",
"pkg_release": "1605554",
"pkg_version": "7.4.0+cdh6.3.2",
"version": "7.4.0-cdh6.3.2"
},
{
"name": "pig",
"pkg_release": "1605554",
"pkg_version": "0.17.0+cdh6.3.2",
"version": "0.17.0-cdh6.3.2"
},
{
"name": "kite",
"pkg_release": "1605554",
"pkg_version": "1.0.0+cdh6.3.2",
"version": "1.0.0-cdh6.3.2"
},
{
"name": "sqoop",
"pkg_release": "1605554",
"pkg_version": "1.4.7+cdh6.3.2",
"version": "1.4.7-cdh6.3.2"
},
{
"name": "hive",
"pkg_release": "1605554",
"pkg_version": "2.1.1+cdh6.3.2",
"version": "2.1.1-cdh6.3.2"
},
{
"name": "sentry",
"pkg_release": "1605554",
"pkg_version": "2.1.0+cdh6.3.2",
"version": "2.1.0-cdh6.3.2"
},
{
"name": "hbase-solr",
"pkg_release": "1605554",
"pkg_version": "1.5+cdh6.3.2",
"version": "1.5-cdh6.3.2"
},
{
"name": "flume-ng",
"pkg_release": "1605554",
"pkg_version": "1.9.0+cdh6.3.2",
"version": "1.9.0-cdh6.3.2"
},
{
"name": "hive-hcatalog",
"pkg_release": "1605554",
"pkg_version": "2.1.1+cdh6.3.2",
"version": "2.1.1-cdh6.3.2"
},
{
"name": "hadoop-httpfs",
"pkg_release": "1605554",
"pkg_version": "3.0.0+cdh6.3.2",
"version": "3.0.0-cdh6.3.2"
},
{
"name": "kudu",
"pkg_release": "1605554",
"pkg_version": "1.10.0+cdh6.3.2",
"version": "1.10.0-cdh6.3.2"
},
{
"name": "oozie",
"pkg_release": "1605554",
"pkg_version": "5.1.0+cdh6.3.2",
"version": "5.1.0-cdh6.3.2"
},
{
"name": "hadoop-kms",
"pkg_release": "1605554",
"pkg_version": "3.0.0+cdh6.3.2",
"version": "3.0.0-cdh6.3.2"
},
{
"name": "zookeeper",
"pkg_release": "1605554",
"pkg_version": "3.4.5+cdh6.3.2",
"version": "3.4.5-cdh6.3.2"
},
{
"name": "hadoop-hdfs",
"pkg_release": "1605554",
"pkg_version": "3.0.0+cdh6.3.2",
"version": "3.0.0-cdh6.3.2"
},
{
"name": "hadoop-yarn",
"pkg_release": "1605554",
"pkg_version": "3.0.0+cdh6.3.2",
"version": "3.0.0-cdh6.3.2"
},
{
"name": "spark",
"pkg_release": "1605554",
"pkg_version": "2.4.0+cdh6.3.2",
"version": "2.4.0-cdh6.3.2"
},
{
"name": "hadoop-mapreduce",
"pkg_release": "1605554",
"pkg_version": "3.0.0+cdh6.3.2",
"version": "3.0.0-cdh6.3.2"
},
{
"name": "hbase",
"pkg_release": "1605554",
"pkg_version": "2.1.0+cdh6.3.2",
"version": "2.1.0-cdh6.3.2"
},
{
"name": "hue",
"pkg_release": "1605554",
"pkg_version": "4.2.0+cdh6.3.2",
"version": "4.2.0-cdh6.3.2"
},
{
"name": "impala",
"pkg_release": "1605554",
"pkg_version": "3.2.0+cdh6.3.2",
"version": "3.2.0-cdh6.3.2"
之前尝试的解决方案: 我尝试了几种配置来构建我的 SparkSession(它们中的大多数现在都留在那里作为评论以提醒我我已经尝试过该解决方案)。遗憾的是,没有配置解决了我的问题。
// .master("yarn")
// .master("spark://192.168.102.24:7077")
.master("local[*]")
.config("spark.sql.shuffle.partitions", 50)
// .config("spark.jars", System.getProperty("user.dir") + "/target/TmcStreamingProcessor-0.0.1-SNAPSHOT.jar")
// .config("spark.sql.warehouse.dir", "hdfs://192.168.102.24:8020/user/spark-warehouse")
// .config("spark.executor.extraClassPath","hdfs://192.168.102.24:8020/user/yarn-jars/*")
// .config("spark.driver.extraClassPath", "hdfs://192.168.102.24:8020/user/yarn-jars/*")
.config("spark.driver.userClassPathFirst", false)
// .config("spark.driver.extraLibraryPath", "hdfs://192.168.102.24:8020/user/yarn-jars/*")
// .enableHiveSupport()
//.config("spark.sql.catalogImplementation", "in-memory")
.config("spark.dynamicAllocation.enabled",false)
.getOrCreate();
我尝试了在类似问题上找到的几种解决方案,例如:
-在每个集群节点上手动上传 jar。
-上传 hdfs 共享文件夹中的 jar。
-在 spark 提交(或代码端,在 sparkSession 配置中)使用 --jars 选项指定 jar 位置。
-将所有项目依赖上传到 hdfs 共享文件夹,在集群配置上指定 hdfs-driver-classpath 和 hdfs-executor-classpath。
但是这些解决方案都不适合我。
感谢您的建议!