Kafka中有40个主题,并且每个处理5个表都写了火花流工作。 火花流工作的唯一目的是读取5 kafka主题并将其写入相应的5 hdfs路径。大部分时间它工作正常,但有时它将主题1数据写入其他hdfs路径。
下面是代码尝试将一个spark流式传输作业存档到进程5主题并将其写入相应的hdfs,但是这个主题1将数据写入HDFS 5而不是HDFS 1。
请提供您的建议:
import java.text.SimpleDateFormat
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{ SparkConf, TaskContext }
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.kafka010._
import org.apache.kafka.common.serialization.StringDeserializer
object SparkKafkaMultiConsumer extends App {
override def main(args: Array[String]) {
if (args.length < 1) {
System.err.println(s"""
|Usage: KafkaStreams auto.offset.reset latest/earliest table1,table2,etc
|
""".stripMargin)
System.exit(1)
}
val date_today = new SimpleDateFormat("yyyy_MM_dd");
val date_today_hour = new SimpleDateFormat("yyyy_MM_dd_HH");
val PATH_SEPERATOR = "/";
import com.typesafe.config.ConfigFactory
val conf = ConfigFactory.load("env.conf")
val topicconf = ConfigFactory.load("topics.conf")
// Create context with custom second batch interval
val sparkConf = new SparkConf().setAppName("pt_streams")
val ssc = new StreamingContext(sparkConf, Seconds(conf.getString("kafka.duration").toLong))
var kafka_topics="kafka.topics"
// Create direct kafka stream with brokers and topics
var topicsSet = topicconf.getString(kafka_topics).split(",").toSet
if(args.length==2 ) {
print ("This stream job will process table(s) : "+ args(1))
topicsSet=args {1}.split(",").toSet
}
val topicList = topicsSet.toList
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString("kafka.brokers"),
"zookeeper.connect" -> conf.getString("kafka.zookeeper"),
"group.id" -> conf.getString("kafka.consumergroups"),
"auto.offset.reset" -> args { 0 },
"enable.auto.commit" -> (conf.getString("kafka.autoCommit").toBoolean: java.lang.Boolean),
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"security.protocol" -> "SASL_PLAINTEXT")
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
for (i <- 0 until topicList.length) {
/**
* set timer to see how much time takes for the filter operation for each topics
*/
val topicStream = messages.filter(_.topic().equals(topicList(i)))
val data = topicStream.map(_.value())
data.foreachRDD((rdd, batchTime) => {
// val data = rdd.map(_.value())
if (!rdd.isEmpty()) {
rdd.coalesce(1).saveAsTextFile(conf.getString("hdfs.streamoutpath") + PATH_SEPERATOR + topicList(i) + PATH_SEPERATOR + date_today.format(System.currentTimeMillis())
+ PATH_SEPERATOR + date_today_hour.format(System.currentTimeMillis()) + PATH_SEPERATOR + System.currentTimeMillis())
}
})
}
try{
// After all successful processing, commit the offsets to kafka
messages.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
messages.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
} catch {
case e: Exception =>
e.printStackTrace()
print("error while commiting the offset")
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
}
答案 0 :(得分:1)
最好使用HDFS connector进行Kafka Connect。它是开源的,可用standalone或Confluent Platform的一部分。从Kafka主题流式传输到HDFS的简单配置文件,如果您有数据架构,它将为您创建Hive表。
如果你自己编写代码,你就会重新发明轮子;这是一个已解决的问题:)