I have got 3 machines running storm supervisors (v0.9.4, each running 4 workers), 3 machines running Kafka Brokers (single topic with partition=6, replication=2) and 2 machines running elasticsearch. Each machine is a 4GB dual core, 3.00GHz machine; all connected with a 1Gbps LAN.
I feed an Apache access log file containing 2 million events (~250MB) using the console producer to the kafka brokers. It takes around 20seconds to load to kafka.
Meanwhile my trident topology is running, and by the time all the 2 million events are indexed into elasticsearch, it is around 7mins. The topology consists of a trident function (ExtractData) that uses regular expressions to extract fields and construct a json.
Question:
(I am not convinced with the 7min time this cluster currently takes to index. That is roughly around 5K events per second.)
Given my current results is it possible to approximate the throughput for any given cluster size and hardware specs (like a linear relation probably)?
Here are a few configuration/stats I obtained.
topology.workers: 12
topology.debug: false
topology.max.spout.pending: 1
topology.message.timeout.secs: 60
topology.trident.batch.emit.interval.millis: 500