我对alpakka_kafka+alpakka_s3
集成有疑问。当我使用alpakka kafka源时,Alpakka s3 multipartUpload
似乎没有上传文件。
kafkaSource ~> kafkaSubscriber.serializer.deserializeFlow ~> bcast.in
bcast.out(0) ~> kafkaMsgToByteStringFlow ~> s3Sink
bcast.out(1) ~> kafkaMsgToOffsetFlow ~> commitFlow ~> Sink.ignore
但是,我在kafkaSource之后添加了.take(100)
。一切正常。
kafkaSource.take(100) ~> kafkaSubscriber.serializer.deserializeFlow ~> bcast.in
bcast.out(0) ~> kafkaMsgToByteStringFlow ~> s3Sink
bcast.out(1) ~> kafkaMsgToOffsetFlow ~> commitFlow ~> Sink.ignore
任何帮助将不胜感激。预先感谢!
这是完整的代码段:
// Source
val kafkaSource: Source[(CommittableOffset, Array[Byte]), Consumer.Control] = {
Consumer
.committableSource(consumerSettings, Subscriptions.topics(prefixedTopics))
.map(committableMessage => (committableMessage.committableOffset, committableMessage.record.value))
.watchTermination() { (mat, f: Future[Done]) =>
f.foreach { _ =>
log.debug("consumer source shutdown, consumerId={}, group={}, topics={}", consumerId, group, prefixedTopics.mkString(", "))
}
mat
}
}
// Flow
val commitFlow: Flow[CommittableOffset, Done, NotUsed] = {
Flow[CommittableOffset]
.groupedWithin(batchingSize, batchingInterval)
.map(group => group.foldLeft(CommittableOffsetBatch.empty) { (batch, elem) => batch.updated(elem) })
.mapAsync(parallelism = 3) { msg =>
log.debug("committing offset, msg={}", msg)
msg.commitScaladsl().map { result =>
log.debug("committed offset, msg={}", msg)
result
}
}
}
private val kafkaMsgToByteStringFlow = Flow[KafkaMessage[Any]].map(x => ByteString(x.msg + "\n"))
private val kafkaMsgToOffsetFlow = {
implicit val askTimeout: Timeout = Timeout(5.seconds)
Flow[KafkaMessage[Any]].mapAsync(parallelism = 5) { elem =>
Future(elem.offset)
}
}
// Sink
val s3Sink = {
val BUCKET = "test-data"
s3Client.multipartUpload(BUCKET, s"tmp/data.txt")
// Doesnt' work..... ( no files are showing up on the S3)
kafkaSource ~> kafkaSubscriber.serializer.deserializeFlow ~> bcast.in
bcast.out(0) ~> kafkaMsgToByteStringFlow ~> s3Sink
bcast.out(1) ~> kafkaMsgToOffsetFlow ~> commitFlow ~> Sink.ignore
// This one works...
kafkaSource.take(100) ~> kafkaSubscriber.serializer.deserializeFlow ~> bcast.in
bcast.out(0) ~> kafkaMsgToByteStringFlow ~> s3Sink
bcast.out(1) ~> kafkaMsgToOffsetFlow ~> commitFlow ~> Sink.ignore
答案 0 :(得分:1)
实际上,它确实可以上传。问题是,您需要向s3发送完成请求才能完成上传,然后文件才可以在存储桶中使用。我敢打赌,因为没有take(n)
的kafka源代码永远不会停止向下游生成数据,接收器永远不会向s3发送完成请求,因为流实际上不会完成,因此接收器始终希望在完成请求之前要上传更多数据
没有方法可以将所有内容上传到一个文件,因此,我的技巧是:将kafkaSource
消息分组,然后将压缩的Array [Byte]发送到接收器。技巧部分是您必须为每个文件创建一个接收器,而不是仅使用一个接收器。
答案 1 :(得分:0)
private def running: Receive = {
case Subscribe(subscriberId) =>
val kafkaSubscriber = new KafkaSubscriber(
serviceName = "akka_kafka_subscriber",
group = kafkaConfig.group,
topics = kafkaConfig.subscriberTopics,
system = system,
configurationProperties = Seq(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest")
)
RunnableGraph.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val bcast = builder.add(Broadcast[KafkaMessage[Any]](2))
kafkaSource ~> kafkaSubscriber.serializer.deserializeFlow ~> kafkaSubscriber.filterTypeFlow[Any] ~> bcast.in
bcast.out(0) ~> kafkaMsgToStringFlow
.groupedWithin(BATCH_SIZE, BATCH_DURATION)
.map(group => group.foldLeft(new StringBuilder()) { (batch, elem) => batch.append(elem) })
.mapAsync(parallelism = 3) { data =>
self ? ReadyForUpload(ByteString(data.toString()),UUID.randomUUID().toString,subscriberId)
} ~> Sink.ignore
bcast.out(1) ~> kafkaMsgToOffsetFlow ~> kafkaSubscriber.commitFlow ~> Sink.ignore
ClosedShape
}).withAttributes(ActorAttributes.supervisionStrategy(decider)).run()
sender ! "subscription started"
case ready: ReadyForUpload=>
println("==========================Got ReadyForUpload: " + ready.fileName)
val BUCKET = "S3_BUCKET"
Source.single(ready.data).runWith(s3Client.multipartUpload(BUCKET, s"tmp/${ready.fileName}_${ready.subscriberId}.txt"))
sender() ! "Done"