I am creating a kafka stream and sending a single message on which certain tranformations take place and produces the output to a output kafka queue. On just sending a single message, I do not see any output but when I start to send more than one message, (2,3,... so on) I am able to see some output. Can someone explain if there is something I am missing to ensure output for a single message also?
My scenario is something like this :
I have a testcase scenario that writes to a kafka input topic. My service reads from the input topic, writes results to the output topic. The testcase again reads it in from the output topic and displays the final result result. //Producer
producer.send(record, new Callback {
override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = {
if (e != null) {
System.out.print("in system")
print("in print")
ex += "No confirmation received"
logger.warn(s"No confirmation received for $message", e)
}
if (recordMetadata != null) {
ex += "message sent" + s"message $message sent: checksum=${recordMetadata.checksum}, " +
s"offset=${recordMetadata.offset}, partition=${recordMetadata.partition}"
logger.info(s"message $message sent: checksum=${recordMetadata.checksum}, " +
s"offset=${recordMetadata.offset}, partition=${recordMetadata.partition}")
}
}
}).get()
//Consumer
runner = new Thread() {
override def run(): Unit = {
while (inProgress) {
val newMessages = consumer.poll(1000)
val it = newMessages.iterator()
System.out.print("hasNext is"+ it.hasNext.toString)
while(it.hasNext) {
val nxt = it.next()
System.out.print("one more"+ nxt.offset() + " "+nxt.toString)
val record = nxt
logger.info(s"Received record: $record")
store.append(record)
}
}
}
}
The message is going properly to the input topic and I am able to verify that with some log messages. However on the service I am trying to the stream using stream.print()
Try(
KafkaStreaming.createKafkaStream[String, String, StringDeserializer, StringDeserializer]
(ssc, config, topics)
) match {
case Failure(_: SparkException) if nTries > 1 =>
Thread.sleep(500)
createDStream(ssc, config, topics, nTries - 1)
case Failure(exn: Throwable) => throw exn
case Success(stream) => {
stream.foreachRDD(rdd => {
if (rdd.isEmpty() == true)
{print("The rdd recieved is empty")}
else{
rdd.foreach( p => print("The rdd recieved is non empty" + p._1 + p._2))
}
})
stream.print()
stream
}
}
}
I am only able to start seeing the result print from the third record onwards. And always seeing "The rdd recieved is empty" on sending only one record.
Sample output :
The rdd recieved is empty-------------------------------------------
Time: 1513579502000 ms
-------------------------------------------
-------------------------------------------
Time: 1513579504000 ms
-------------------------------------------
(null,{'id':2,'text':'spark fvt'})
+---+---------+-----+
|id |text |label|
+---+---------+-----+
|2 |spark fvt|0.0 |
+---+---------+-----+
+---+---------+-----+------------+--------------------+--------------------+--------------------+----------+
| id| text|label| words| features| rawPrediction| probability|prediction|
+---+---------+-----+------------+--------------------+--------------------+--------------------+----------+
| 2|spark fvt| 0.0|[spark, fvt]|(1000,[105,983],[...|[0.16293291377568...|[0.54064335448518...| 0.0|
+---+---------+-----+------------+--------------------+--------------------+--------------------+----------+
The value of sendToSink [id: bigint, text: string ... 6 more fields]The rdd recieved is empty-------------------------------------------
Time: 1513579506000 ms
-------------------------------------------
The rdd recieved is empty-------------------------------------------
Time: 1513579508000 ms
-------------------------------------------
The rdd recieved is empty-------------------------------------------
Time: 1513579510000 ms
-------------------------------------------
The rdd recieved is empty-------------------------------------------
Time: 1513579512000 ms
-------------------------------------------
The rdd recieved is empty-------------------------------------------
Time: 1513579514000 ms
-------------------------------------------
-------------------------------------------
Time: 1513579516000 ms
-------------------------------------------
(null,{'id':3,'text':'spark fvt'})
+---+---------+-----+
|id |text |label|
+---+---------+-----+
|3 |spark fvt|0.0 |
+---+---------+-----+
+---+---------+-----+------------+--------------------+--------------------+--------------------+----------+
| id| text|label| words| features| rawPrediction| probability|prediction|
+---+---------+-----+------------+--------------------+--------------------+--------------------+----------+
| 3|spark fvt| 0.0|[spark, fvt]|(1000,[105,983],[...|[0.16293291377568...|[0.54064335448518...| 0.0|
+---+---------+-----+------------+--------------------+--------------------+--------------------+----------+
PS : I am able to retrieve the first record if the auto.commit.reset is set to "earliest". But if set to "latest", first record goes missing. Any thoughts?