Question

我有一个Spring Batch应用程序（3.0.7），通过Spring Boot启动，它并行读取大量XML文件，处理它们，并针对Oracle DB“吐出”INSERT或UPDATE语句。

为了并行处理文件，我使用的是Partitioner。该工作正常，除了JdbcWriter，它似乎只绑定到一个线程。由于我使用的是ThreadPoolTaskExecutor，我预计Step可以并行运行于阅读器，处理器和编写器。但是，似乎JdbcWriter始终绑定到Thread-1（我可以在日志中看到，但也分析数据库连接，只有一个连接处于活动状态 - 请注意我的数据源配置为使用具有20个连接的池）。

我已将读者，处理器和编写者注释为@StepScope。如何有效地使用taskExecutor中所有已配置的线程并行读取 AND WRITE ？

这是我配置的摘录：

@Bean
public Job parallelJob() throws Exception {
    return jobBuilderFactory.get("parallelJob")
            .start(splitFileStep())
            .next(recordPartitionStep())
            .build();
}

@Bean
public Step recordPartitionStep() {
    return stepBuilderFactory.get("factiva-recordPartitionStep")
            .partitioner(recordStep())
            .partitioner("recordStep", recordPartitioner(null)) <!-- this is used to inject some data from the job context
            .taskExecutor(taskExecutor())
            .build();
}

@Bean
public Step recordStep() {
    return stepBuilderFactory.get("recordStep")
            .<Object, StatementHolderMap>chunk(1000)
            .reader(recordReader(null)) <!-- this is used to inject some data from the job context
            .processor(recordProcessor) <!-- this is @Autowired, and the bean is marked as @StepScope
            .writer(jdbcItemWriter())
            .build();
}

@Bean
@StepScope
public ItemStreamReader recordReader(@Value("#{stepExecutionContext['record-file']}") Resource resource) {
    // THIS IS A StaxEventItemReader
}

@Bean
@StepScope
public JdbcItemWriter jdbcItemWriter() {

    JdbcItemWriter jdbcItemWriter = new JdbcItemWriter();
    jdbcItemWriter.setDataSource(dataSource);
    ...
    return jdbcItemWriter;
}

@Value("${etl.factiva.partition.cores}")
private int threadPoolSize;

@Bean
public TaskExecutor taskExecutor() {
    ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
    if (threadPoolSize == 0) {
        threadPoolSize = Runtime.getRuntime().availableProcessors();
    }
    taskExecutor.setMaxPoolSize(threadPoolSize);
    taskExecutor.afterPropertiesSet();

    return taskExecutor;
}

Answer 1

我弄清楚为什么Spring Batch没有使用所有已配置的线程。

首先，分区程序的Spring配置是错误的。原始配置未设置gridSize值，并且错误地引用了在分区中运行的步骤。

其次，原始配置中使用的ThreadPoolTaskExecutor似乎不起作用。切换到SimpleAsyncTaskExecutor就可以了。

我仍然不确定为什么ThreadPoolTaskExecutor无效。 SimpleAsyncTaskExecutor的javadoc实际上使用池来重新命令重用线程。

我也不是100％确定我完全理解设置gridSize值的含义。目前，我将gridSize设置为一个等于所用线程数的值在分区步骤中。如果有人可以评论这种方法@Michael Minella，那会很棒吗？：）

这是正确的配置，仅供参考。

@Bean
public Job parallelJob() throws Exception {
    return jobBuilderFactory.get("parallelJob")
            .start(splitFileStep())
            .next(recordPartitionStep())
            .build();
}

@Bean
public Step recordPartitionStep() {
    return stepBuilderFactory.get("factiva-recordPartitionStep")
            .partitioner(recordStep().getName(), recordPartitioner(null)) <!-- the value for the recordPartitioner constructor is injected at runtime
            .step(recordStep())
            .gridSize(determineWorkerThreads()) <!-- GRID SIZE VALUE MUST BE EQUAL TO THE NUMBER OF THREAD CONFIGURED FOR THE THREAD POOL
            .taskExecutor(taskExecutor())
            .build();


}

@Bean
public Step recordStep() {
    return stepBuilderFactory.get("recordStep")
            .<Object, StatementHolderMap>chunk(1000)
            .reader(recordReader(null)) <!-- this is used to inject some data from the job context
            .processor(recordProcessor) <!-- this is @Autowired, and the bean is marked as @StepScope
            .writer(jdbcItemWriter())
            .build();
}

@Bean
@StepScope
public ItemStreamReader recordReader(@Value("#{stepExecutionContext['record-file']}") Resource resource) {
    // THIS IS A StaxEventItemReader
}

@Bean
@StepScope
public JdbcItemWriter jdbcItemWriter() {

    JdbcItemWriter jdbcItemWriter = new JdbcItemWriter();
    jdbcItemWriter.setDataSource(dataSource);
    ...
    return jdbcItemWriter;
}

@Value("${etl.factiva.partition.cores}")
private int threadPoolSize;

@Bean
public TaskExecutor taskExecutor() {
    SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor("fac-thrd-");

    taskExecutor.setConcurrencyLimit(determineWorkerThreads());
    return taskExecutor;
}

// threadPoolSize is a configuration parameter for the job
private int determineWorkerThreads() {
    if (threadPoolSize == 0) {
        threadPoolSize = Runtime.getRuntime().availableProcessors();
    }
    return threadPoolSize;

}

Spring Batch - 如何使用分区来并行读取和写入数据？

1 个答案: