我正在尝试编写一个用于将消息发布到kafka主题的spark java代码。它按预期工作。但是,加载数据所需的时间与Naive方法的规模相似。
你能帮我理解我是否错过了什么?
Spark制作方式:
public void loadInputDataToKafka(JavaRDD<Customer> inputData)
{
inputData.foreachPartition(customerIterator -> {
Properties properties = getKafkaConfiguration();
Producer<String, String> producer = new KafkaProducer<String, String>(properties);
while(customerIterator.hasNext()) {
producer.send(new ProducerRecord<String, String>("testTopic",
null, customerIterator.next().toString())).get();
}
producer.close();
});
}
天真的方法:
public void loadInputDataToKafka(JavaRDD<Customer> inputData)
{
Properties properties = getKafkaConfiguration();
Producer<String, String> producer = new KafkaProducer<String, String>(properties);
List<Customer> customerList = inputData.collect();
for(Customer customer:customerList) {
try {
producer.send(new ProducerRecord<String, String>("testTopic",
null, customer.toString())).get();
} catch (Exception e) {
producer.close();
throw Throwables.propagate(e);
}
}
使用方法1和2加载80K记录所花费的时间相似(150秒)。我没有以正确的方式使用火花分区吗?如果不是,我从火花中获得的优势与常规迭代的优势是什么?