Question

我正在使用这个用例来处理Spring批处理的大文件，其中我必须将文件中的行文件分成多个桶（db shards）并将它们写入DB（并行写入每个分片）。每行的哪个分片依赖于每行上的shardkey，它是输入文件的一部分。我想到了两个选项：

选项1：

Split the original file into n diffent files using Classifiers in Spring batch. 
Process each file using split and flows and write them to the DB.

选项2：

Read each line and reorder the file such that all buckets are in order. While doing this I keep track of where each bucket starts and ends. 
Create a partioner and provide the above info in its ExcutionContext and write to the DB in parallel.

我能做些什么更好的事情？有任何线索上述选项哪个更好？

由于

Answer 1

您不需要块处理（读取+处理+写入）来拆分您拥有的文件。您可以创建一个自定义tasklet来封装在基于您正在读取的行中获得的shardkey的“临时区域”中将文件拆分为新临时文件的逻辑。这种拆分操作在java代码中很便宜，但根据您的环境，您甚至可以调用system command进行排序和拆分。

完成拆分后，您可以使用this示例中所述的MultiResourcePartitioner进行并行处理。如果您的大文件是第三方月/周增量文件，还可以考虑开发一个步骤/ tasklet来清理临时目录中的所有临时文件，然后再进行处理。

Spring Batch：基于自定义逻辑的分区文件，并行写入DB

1 个答案: