Question

我有一个大文件，可能包含10亿到50亿条记录。我打算使用面向块的处理，我的想法是

1）根据每个文件中的10K计数将大文件拆分为较小的文件。

2）如果有10亿条记录，那么我将获得10000个文件，每个文件包含10K条记录

3）我想对这10000个文件进行分区，并希望使用10个线程进行处理。我使用自定义MultiResourcePartioner

4）10个线程应该处理在拆分过程中创建的所有10000个文件。

5）我不想创建与文件数相同数量的线程，因为在这种情况下我可能会遇到内存问题。我正在寻找的是无论我想使用10个线程处理它们的文件数量。

专家你能让我知道这可以用弹簧批来实现吗？如果是的话，请分享指针或参考实现。

实施例：

<bean id="transformPartitioner"
    class="com.example.transformers.partition.TransformerPartitioner">
    <property name="outputPath" value="${output.directory}" />
</bean>

<bean id="loadTransformData" class="com.example.transformers.step.LoadTransformData"
    factory-method="reader" scope="step">
    <constructor-arg value="#{stepExecutionContext[outputFile]}" />
</bean>

<bean id="processTransformData" class="com.example.transformers.step.ProcessTransformData"
    scope="step">
    <property name="threadName" value="#{stepExecutionContext[threadName]}" />
    <property name="sourceFileName" value="#{jobParameters[filename]}" />       
</bean>

<bean id="notifyToJMS" class="com.example.transformers.step.NotifyToJMS"
    scope="step">
    <property name="fileName" value="#{stepExecutionContext[outputFile]}" />
</bean>

<bean id="outputFileDeletingTasklet"
    class="com.example.transformers.step.OutputFileDeletingTasklet"
    scope="step">
    <property name="directory" value="file:${output.directory}" />
</bean>

<bean class="org.springframework.batch.core.scope.StepScope" />

<bean id="jobRepository"
    class="org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean">
    <property name="transactionManager" ref="transactionManager" />
</bean>

<bean id="jobLauncher"
    class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
    <property name="jobRepository" ref="jobRepository" />
</bean>

<bean id="transactionManager"
    class="org.springframework.batch.support.transaction.ResourcelessTransactionManager" />

自定义多资源分区：

public Map<String, ExecutionContext> partition(int gridSize) {

    int index = 0;
    File directory = new File(outputPath);
    File[] fList = directory.listFiles();
    Map result = new HashMap(gridSize);

    for (File file : fList) {
        if (file.isFile()) {
            ExecutionContext exContext = new ExecutionContext();
            logger.info(loggerClassName+" Starting : Thread [" + index + "] for file : " + file.getName());
            exContext.put(constants.THREAD_NAME, "Thread" + index);
            exContext.put(constants.OUTPUT_FILE, outputPath + file.getName());
            exContext.put(constants.OUTPUT_FILE_NAME, file.getName());
            result.put(constants.PARTITION + index, exContext);
            index++;
        }
    }

感谢您的回复。

Answer 1

首先阅读我的this answer以了解如果分区数超过100，Spring Batch的性能不佳，即Spring Batch API本身开始花费太多时间来准备元表中的数据。这是无法理解的，但事实就是如此。

其次，将大文件拆分成较小的文件是正确的 - 这就是解决问题的方法。在此预处理中，您可能希望为每个文件名分配一个标识符，以便以后可以轻松地对其进行分组。

您不正确的部分是创建与文件数量一样多的分区 - 如果您有10k文件，并且如果Spring Batch API需要永久创建1000个分区的元数据，您可以想象它对10k分区的行为如何。

您需要做的是修复作业中的分区数，其中一个分区表示一组文件而不是一个文件。这取决于您希望如何实现该分组。让我们说50个分区，这样你就可以将你的10K文件分成50组 - 每个分区有200个文件。

在您的代码中，您使用gridSize仅初始化地图，使用它来修复您的分区数量。

现在，Spring Batch可以选择您希望并行启动多少个分区（第5点） - 阅读my this answer的第3步。您可以使用Async任务执行程序或线程池。并行性取决于您的服务器容量。

这样你的一个线程就会处理一堆文件而不是一个等等。在总分区中，只有少数分区一次保持活动状态，其余分区将处于未开始状态。

Answer 2

我根据批次基础解决了这个问题。我将分区限制修复为100，每个分区将负责完成多个文件。 1）在每个分区中添加了多个文件。 2）实现多资源项阅读器以读取多个文件并委托给项目阅读器。

感谢Sabir !!!的建议。

大型文件的Spring批处理（10亿到50亿平面文件数据）

2 个答案: