我是一个ETL过程我正在从Spring Data Repository中检索很多实体。然后我使用并行流将实体映射到不同的实体。 我可以使用使用者将这些新实体逐个存储在另一个存储库中,也可以将它们收集到List中并将其存储在单个批量操作中。 第一种是昂贵的,而后者可能超过可用的内存。
有没有一种很好的方法来收集流中的一定数量的元素(如限制),使用该块,并继续并行处理直到所有元素都被处理?
答案 0 :(得分:21)
我使用分块进行批量操作的方法是使用分区拆分器包装器,另一个包装器覆盖默认拆分策略(批量大小的算术级数,以1024为增量)到简单的固定批量拆分。像这样使用它:
Stream<OriginalType> existingStream = ...;
Stream<List<OriginalType>> partitioned = partition(existingStream, 100, 1);
partitioned.forEach(chunk -> ... process the chunk ...);
以下是完整代码:
import java.util.ArrayList;
import java.util.List;
import java.util.Spliterator;
import java.util.Spliterators.AbstractSpliterator;
import java.util.function.Consumer;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;
public class PartitioningSpliterator<E> extends AbstractSpliterator<List<E>>
{
private final Spliterator<E> spliterator;
private final int partitionSize;
public PartitioningSpliterator(Spliterator<E> toWrap, int partitionSize) {
super(toWrap.estimateSize(), toWrap.characteristics() | Spliterator.NONNULL);
if (partitionSize <= 0) throw new IllegalArgumentException(
"Partition size must be positive, but was " + partitionSize);
this.spliterator = toWrap;
this.partitionSize = partitionSize;
}
public static <E> Stream<List<E>> partition(Stream<E> in, int size) {
return StreamSupport.stream(new PartitioningSpliterator(in.spliterator(), size), false);
}
public static <E> Stream<List<E>> partition(Stream<E> in, int size, int batchSize) {
return StreamSupport.stream(
new FixedBatchSpliterator<>(new PartitioningSpliterator<>(in.spliterator(), size), batchSize), false);
}
@Override public boolean tryAdvance(Consumer<? super List<E>> action) {
final ArrayList<E> partition = new ArrayList<>(partitionSize);
while (spliterator.tryAdvance(partition::add)
&& partition.size() < partitionSize);
if (partition.isEmpty()) return false;
action.accept(partition);
return true;
}
@Override public long estimateSize() {
final long est = spliterator.estimateSize();
return est == Long.MAX_VALUE? est
: est / partitionSize + (est % partitionSize > 0? 1 : 0);
}
}
import static java.util.Spliterators.spliterator;
import java.util.Comparator;
import java.util.Spliterator;
import java.util.function.Consumer;
public abstract class FixedBatchSpliteratorBase<T> implements Spliterator<T> {
private final int batchSize;
private final int characteristics;
private long est;
public FixedBatchSpliteratorBase(int characteristics, int batchSize, long est) {
characteristics |= ORDERED;
if ((characteristics & SIZED) != 0) characteristics |= SUBSIZED;
this.characteristics = characteristics;
this.batchSize = batchSize;
this.est = est;
}
public FixedBatchSpliteratorBase(int characteristics, int batchSize) {
this(characteristics, batchSize, Long.MAX_VALUE);
}
public FixedBatchSpliteratorBase(int characteristics) {
this(characteristics, 64, Long.MAX_VALUE);
}
@Override public Spliterator<T> trySplit() {
final HoldingConsumer<T> holder = new HoldingConsumer<>();
if (!tryAdvance(holder)) return null;
final Object[] a = new Object[batchSize];
int j = 0;
do a[j] = holder.value; while (++j < batchSize && tryAdvance(holder));
if (est != Long.MAX_VALUE) est -= j;
return spliterator(a, 0, j, characteristics());
}
@Override public Comparator<? super T> getComparator() {
if (hasCharacteristics(SORTED)) return null;
throw new IllegalStateException();
}
@Override public long estimateSize() { return est; }
@Override public int characteristics() { return characteristics; }
static final class HoldingConsumer<T> implements Consumer<T> {
Object value;
@Override public void accept(T value) { this.value = value; }
}
}
import static java.util.stream.StreamSupport.stream;
import java.util.Spliterator;
import java.util.function.Consumer;
import java.util.stream.Stream;
public class FixedBatchSpliterator<T> extends FixedBatchSpliteratorBase<T> {
private final Spliterator<T> spliterator;
public FixedBatchSpliterator(Spliterator<T> toWrap, int batchSize, long est) {
super(toWrap.characteristics(), batchSize, est);
this.spliterator = toWrap;
}
public FixedBatchSpliterator(Spliterator<T> toWrap, int batchSize) {
this(toWrap, batchSize, toWrap.estimateSize());
}
public FixedBatchSpliterator(Spliterator<T> toWrap) {
this(toWrap, 64, toWrap.estimateSize());
}
public static <T> Stream<T> withBatchSize(Stream<T> in, int batchSize) {
return stream(new FixedBatchSpliterator<>(in.spliterator(), batchSize), true);
}
public static <T> FixedBatchSpliterator<T> batchedSpliterator(Spliterator<T> toWrap, int batchSize) {
return new FixedBatchSpliterator<>(toWrap, batchSize);
}
@Override public boolean tryAdvance(Consumer<? super T> action) {
return spliterator.tryAdvance(action);
}
@Override public void forEachRemaining(Consumer<? super T> action) {
spliterator.forEachRemaining(action);
}
}
答案 1 :(得分:4)
您可以编写自己的Collector
来累积实体,然后执行批量更新。
Collector.accumulator()
方法可以将实体添加到内部临时缓存中,直到缓存增长得太大。当缓存足够大时,您可以在其他存储库中执行批量存储。
Collector.merge()
需要将2个线程的收集器缓存合并到一个缓存中(并可能合并)
最后,当Stream完成时调用Collector.finisher()
方法,因此也存储在缓存中的任何内容。
由于您已经在使用并行流并且在同时执行多个加载时似乎没问题,因此我假设您已经处理了线程安全性。
<强>更新强>
我对线程安全和并行流的评论是指实际保存/存储到存储库中,而不是临时集合中的并发性。
每个收集器应该(我认为)在自己的线程中运行。并行流应通过多次调用supplier()
来创建多个收集器实例。因此,您可以将收集器实例视为单线程,它应该可以正常工作。
例如,在java.util.IntSummaryStatistics
的Javadoc中,它说:
此实现不是线程安全的。但是,在并行流上使用Collectors.toIntStatistics()是安全的,因为Stream.collect()的并行实现提供了必要的分区,隔离和合并结果,以实现安全有效的并行执行。
答案 2 :(得分:1)
您可以使用自定义收集器来优雅地执行此操作。
请在此处查看我对类似问题的回答:
Custom batch processing collector
然后,您可以使用上面的收集器并行批处理流以将记录存储回存储库中,示例用法:
List<Integer> input = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
int batchSize = 3;
Consumer<List<Integer>> batchProcessor = xs -> repository.save(xs);
input.parallelStream()
.map(i -> i + 1)
.collect(StreamUtils.batchCollector(batchSize, batchProcessor));
答案 3 :(得分:0)
@Test
public void streamTest(){
Stream<Integer> data = Stream.generate(() -> {
//Block on IO
return blockOnIO();
});
AtomicInteger countDown = new AtomicInteger(1000);
final ArrayList[] buffer = new ArrayList[]{new ArrayList<Integer>()};
Object syncO = new Object();
data.parallel().unordered().map(i -> i * 1000).forEach(i->{
System.out.println(String.format("FE %s %d",Thread.currentThread().getName(), buffer[0].size()));
int c;
ArrayList<Integer> export=null;
synchronized (syncO) {
c = countDown.addAndGet(-1);
buffer[0].add(i);
if (c == 0) {
export=buffer[0];
buffer[0] = new ArrayList<Integer>();
countDown.set(1000);
}
}
if(export !=null){
sendBatch(export);
}
});
//export any remaining
sendBatch(buffer[0]);
}
Integer blockOnIO(){
try {
Thread.sleep(50);
return Integer.valueOf((int)Math.random()*1000);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
void sendBatch(ArrayList al){
assert al.size() == 1000;
System.out.println(String.format("LOAD %s %d",Thread.currentThread().getName(), al.size()));
}
这可能有些过时但应该以最小的锁定方式实现批量处理。
它将产生输出
FE ForkJoinPool.commonPool-worker-2 996
FE ForkJoinPool.commonPool-worker-5 996
FE ForkJoinPool.commonPool-worker-4 998
FE ForkJoinPool.commonPool-worker-3 999
LOAD ForkJoinPool.commonPool-worker-3 1000
FE ForkJoinPool.commonPool-worker-6 0
FE ForkJoinPool.commonPool-worker-1 2
FE ForkJoinPool.commonPool-worker-7 2
FE ForkJoinPool.commonPool-worker-2 4
答案 4 :(得分:0)
以下是我的图书馆的解决方案:AbacusUtil:
stream.split(batchSize).parallel(threadNum).map(yourBatchProcessFunction);