我正在使用Java。
我收到了Kafka消息的文件路径。我需要将此文件加载到spark RDD中,对其进行处理,然后将其转储到HDFS中。
我能够从Kafka消息中检索文件路径。我希望在这个文件上创建一个数据集/ RDD。
我无法在Kafka消息数据集上运行map函数。由于工作人员无法使用sparkContext
,因此NPE出错。
我无法在Kafka消息数据集上运行foreach
。它错误地显示消息:
Queries with streaming sources must be executed with writeStream.start();"
我无法collect
从kafka消息数据集收到的数据,因为它错误地显示了消息
Queries with streaming sources must be executed with writeStream.start();;
我想这必须是一个非常通用的用例,并且必须在很多设置中运行。
如何从我在Kafka消息中收到的路径中将文件作为RDD加载?
SparkSession spark = SparkSession.builder()
.appName("MyKafkaStreamReader")
.master("local[4]")
.config("spark.executor.memory", "2g")
.getOrCreate();
// Create DataSet representing the stream of input lines from kafka
Dataset<String> kafkaValues = spark.readStream()
.format("kafka")
.option("spark.streaming.receiver.writeAheadLog.enable", true)
.option("kafka.bootstrap.servers", Configuration.KAFKA_BROKER)
.option("subscribe", Configuration.KAFKA_TOPIC)
.option("fetchOffset.retryIntervalMs", 100)
.option("checkpointLocation", "file:///tmp/checkpoint")
.load()
.selectExpr("CAST(value AS STRING)").as(Encoders.STRING());
Dataset<String> messages = kafkaValues.map(x -> {
ObjectMapper mapper = new ObjectMapper();
String m = mapper.readValue(x.getBytes(), String.class);
return m;
}, Encoders.STRING() );
// ====================
// TEST 1 : FAILS
// ====================
// CODE TRYING TO execute MAP on the received RDD
// This fails with a Null pointer exception because "spark" is not available on worker node
/*
Dataset<String> statusRDD = messages.map(message -> {
// BELOW STATEMENT FAILS
Dataset<Row> fileDataset = spark.read().option("header", "true").csv(message);
Dataset<Row> dedupedFileDataset = fileDataset.dropDuplicates();
dedupedFileDataset.rdd().saveAsTextFile(getHdfsLocation());
return getHdfsLocation();
}, Encoders.STRING());
StreamingQuery query2 = statusRDD.writeStream().outputMode("append").format("console").start();
*/
// ====================
// TEST 2 : FAILS
// ====================
// CODE BELOW FAILS WITH EXCEPTION
// "Queries with streaming sources must be executed with writeStream.start();;"
// Hence, processing the deduplication on the worker side using
/*
JavaRDD<String> messageRDD = messages.toJavaRDD();
messageRDD.foreach( message -> {
Dataset<Row> fileDataset = spark.read().option("header", "true").csv(message);
Dataset<Row> dedupedFileDataset = fileDataset.dropDuplicates();
dedupedFileDataset.rdd().saveAsTextFile(getHdfsLocation());
});
*/
// ====================
// TEST 3 : FAILS
// ====================
// CODE TRYING TO COLLECT ALSO FAILS WITH EXCEPTION
// "Queries with streaming sources must be executed with writeStream.start();;"
// List<String> mess = messages.collectAsList();
如何阅读创建文件路径并在文件上创建RDD?
答案 0 :(得分:1)
在结构化流中,我认为没有办法在一个流中统一数据以用作数据集操作的参数。
在Spark生态系统中,通过组合Spark Streaming和Spark SQL(数据集)可以实现这一点。我们可以使用Spark Streaming来使用Kafka主题,然后使用Spark SQL,我们可以加载相应的数据并应用所需的过程。
这样的工作看起来大致如下:(这是在Scala中,Java代码将遵循相同的结构。只是实际的代码更冗长)
// configure and create spark Session
val spark = SparkSession
.builder
.config(...)
.getOrCreate()
// create streaming context with a 30-second interval - adjust as required
val streamingContext = new StreamingContext(spark.sparkContext, Seconds(30))
// this uses Kafka080 client. Kafka010 has some subscription differences
val kafkaParams = Map[String, String](
"metadata.broker.list" -> kafkaBootstrapServer,
"group.id" -> "job-group-id",
"auto.offset.reset" -> "largest",
"enable.auto.commit" -> (false: java.lang.Boolean).toString
)
// create a kafka direct stream
val topics = Set("topic")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
streamingContext, kafkaParams, topics)
// extract the values from the kafka message
val dataStream = stream.map{case (id, data) => data}
// process the data
dataStream.foreachRDD { dataRDD =>
// get all data received in the current interval
// We are assuming that this data fits in memory.
// We're not processing a million files per second, are we?
val files = dataRDD.collect()
files.foreach{ file =>
// this is the process proposed in the question --
// notice how we have access to the spark session in the context of the foreachRDD
val fileDataset = spark.read().option("header", "true").csv(file)
val dedupedFileDataset = fileDataset.dropDuplicates()
// this can probably be written in terms of the dataset api
//dedupedFileDataset.rdd().saveAsTextFile(getHdfsLocation());
dedupedFileDataset.write.format("text").mode("overwrite").save(getHdfsLocation())
}
}
// start the streaming process
streamingContext.start()
streamingContext.awaitTermination()