以下是我的工作流程。
步骤1:下载大文件 - 超过5 GB
第2步:将文件拆分为小文件
步骤3:处理文件 - 使用partitioner和taskExecutor处理拆分文件
重复步骤1到3,直到满足某些条件 - 通常重复超过10k
批量开始表现非常好。但随着时间的推移,表演性能会下降 注意 - 数据处理不是瓶颈
我怀疑重复分区步骤是为每次重复创建新线程。
以下是我的配置
@Bean
public Job myjob(JobBuilderFactory jobs) throws Exception {
return jobs.get("myjob")
.start(DownloadStep())
.next(master()).on(CONTINUE_CONDITION .next(DownloadStep())
.next(master()).on(STOP_CONDITION).to(cleanUpStep())
.build();
}
@Bean
@StepScope
public Partitioner partitioner() {
MultiResourcePartitioner multiResourcePartitioner = new MultiResourcePartitioner();
ClassLoader cl = this.getClass().getClassLoader();
ResourcePatternResolver resolver = new PathMatchingResourcePatternResolver(cl);
Resource[] resources = resolver.getResources("file:" some file path);
multiResourcePartitioner.setResources(resources);
multiResourcePartitioner.partition(10);
return multiResourcePartitioner;
}
@Bean
public TaskExecutor taskExecutor() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setMaxPoolSize(30);
return taskExecutor;
}
@Bean
@Qualifier("master")
public Step master() {
return stepBuilderFactory.get("master")
.partitioner(process())
.partitioner("process",partitioner)
.taskExecutor(taskExecutor())
.build();
}
更新根据Michael Minella的建议,我将Spring Batch版本更新为4.0.0.RC1但性能没有改善
重复150次以上批次创建分区步骤超过15分钟。我为每个文件创建了16个分区。
2017-12-05 17:00:51,660 INFO [THREAD ID=main] XXXXXConfiguration. - Resource files: 16
2017-12-05 17:11:20,923 INFO [THREAD ID=taskExecutor-17] XXXProcessor. - XXXXListener beforeStep StepExecution: id=841, version=1, name=XXXStep:partition14, st
atus=STARTED, exitStatus=EXECUTING, readCount=0, filterCount=0, writeCount=0 readSkipCount=0, writeSkipCount=0, processSkipCount=0, commitCount=0, rollbackCount=0