Question

作为Hazelcast Jet的新手，我正在尝试构建一个设置，其中来自无限源的单个项目（即用户请求的Map Journal）是针对（可能更改的）巨大的参考项目Map进行MapReduced。

具体来说，对于这个例子，我想确定矢量地图中最小欧几里德距离的矢量（读：float[]）（参考），给定一个使用定义的输入向量（查询）。

如果在一台机器上天真地实现，这将通过参考的Map项目并确定每个参数的查询的欧几里德距离，同时保持k-最小匹配，其中输入来自a用户请求（HTTP POST，按钮单击等），结果集在计算完成后立即可用。

我最近的做法是：

在地图日志上收听请求
.distributed().broadcast()对映射作业的请求
让映射作业获得参考向量的.localKeySet()
发出k个最小向量的ID（按欧几里德距离）
通过.partitioned(item -> item.requestId)分区
将结果存储到客户端具有密钥监听器的映射中。

从概念上讲，每个查询都是一批大小1，我实际上正在处理批次。但是，我遇到了很多麻烦，让映射器和缩减器知道批处理何时完成，以便收集器知道它们何时完成（以便它们能够发出最终结果）。

我尝试使用带有真实和假时间戳的水印（通过AtomicLong实例自动获得）并从tryProcessWm函数发出，但这似乎是一个非常脆弱的解决方案，因为一些事件被丢弃了。我还需要确保没有两个请求是交错的（即在请求ID上使用分区），但同时让映射器在所有节点上运行...

我将如何攻击此任务？

编辑＃1：

现在，我的mapper看起来像这样：

private static class EuclideanDistanceMapP extends AbstractProcessor {
    private IMap<Long, float[]> referenceVectors;

    final ScoreComparator comparator = new ScoreComparator();

    @Override
    protected void init(@Nonnull Context context) throws Exception {
        this.referenceVectors = context.jetInstance().getMap(REFERENCE_VECTOR_MAP_NAME);
        super.init(context);
    }

    @Override
    protected boolean tryProcess0(@Nonnull Object item) {
        final Tuple3<Long, Long, float[]> query = (Tuple3<Long, Long, float[]>)item;
        final long requestId = query.f0();
        final long timestamp = query.f1();
        final float[] queryVector = query.f2();

        final TreeSet<Tuple2<Long, Float>> buffer = new TreeSet<>(comparator);
        for (Long vectorKey : referenceVectors.localKeySet()) {
            float[] referenceVector = referenceVectors.get(vectorKey);
            float distance = 0.0f;

            for (int i = 0; i < queryVector.length; ++i) {
                distance += (queryVector[i] - referenceVector[i]) * (queryVector[i] - referenceVector[i]);
            }

            final Tuple2<Long, Float> score = Tuple2.tuple2(vectorKey, (float) Math.sqrt(distance));
            if (buffer.size() < MAX_RESULTS) {
                buffer.add(score);
                continue;
            }

            // If the value is larger than the largest entry, discard it.
            if (comparator.compare(score, buffer.last()) >= 0) {
                continue;
            }

            // Otherwise we remove the largest entry after adding the new one.
            buffer.add(score);
            buffer.pollLast();
        }

        return tryEmit(Tuple3.tuple3(requestId, timestamp, buffer.toArray()));
    }

    private static class ScoreComparator implements Comparator<Tuple2<Long, Float>> {
        @Override
        public int compare(Tuple2<Long, Float> a, Tuple2<Long, Float> b) {
            return Float.compare(a.f1(), b.f1());
        }
    }
}

reducer基本上是复制的（当然，减去矢量计算）。

编辑＃2：

这是DAG设置。当存在多个并发请求时，它当前失败。由于水印，大部分物品都被丢弃了。

DAG dag = new DAG();
Vertex sourceStream = dag.newVertex("source",
    SourceProcessors.<Long, float[], Tuple2<Long, float[]>>streamMapP(QUERY_VECTOR_MAP_NAME,
            e -> e.getType() == EntryEventType.ADDED || e.getType() == EntryEventType.UPDATED,
            e -> Tuple2.tuple2(e.getKey(), e.getNewValue()),true));

// simple map() using an AtomicLong to create the timestamp    
Vertex addTimestamps = dag.newVertex("addTimestamps", AddTimestampMapP::new);

// the class shown above.
Vertex map = dag.newVertex("map", EuclideanDistanceMapP::new);

Vertex insertWatermarks = dag.newVertex("insertWatermarks",
        insertWatermarksP((Tuple3<Long, Long, float[]> t) -> t.f1(), withFixedLag(0), emitByMinStep(1)));

Vertex combine = dag.newVertex("combine", CombineP::new);

// simple map() that drops the timestamp
Vertex removeTimestamps = dag.newVertex("removeTimestamps", RemoveTimestampMapP::new);

// Using a list here for testing.
Vertex sink = dag.newVertex("sink", SinkProcessors.writeListP(SINK_NAME));

dag.edge(between(sourceStream, addTimestamps))
    .edge(between(addTimestamps, map.localParallelism(1))
        .broadcast()
        .distributed())
    .edge(between(map, insertWatermarks).isolated())
    .edge(between(insertWatermarks, combine.localParallelism(1))
            .distributed()
            .partitioned((Tuple2<Long, Tuple2<Long, Float>[]> item) -> item.f0()))
    .edge(between(combine, removeTimestamps)
            .partitioned((Tuple3<Long, Long, Tuple2<Long, Float>[]> item) -> item.f0()))
    .edge(between(removeTimestamps, sink.localParallelism(1)));

编辑＃3：

这是我目前的合并器实现。我假设所有物品都会根据水印订购;或者通常，同一组合器实例仅收集相同请求的项目。这似乎不是真的......

private static class CombineP extends AbstractProcessor {
    private final ScoreComparator comparator = new ScoreComparator();
    private final TreeSet<Tuple2<Long, Float>> buffer = new TreeSet<>(comparator);
    private Long requestId;
    private Long timestamp = -1L;

    @Override
    protected boolean tryProcess0(@Nonnull Object item) {
        final Tuple3<Long, Long, Tuple2<Long, Float>[]> itemTuple = (Tuple3<Long, Long, Tuple2<Long, Float>[]>)item;
        requestId = itemTuple.f0();

        final long currentTimestamp = itemTuple.f1();
        if (currentTimestamp > timestamp) {
            buffer.clear();
        }
        timestamp = currentTimestamp;

        final Object[] scores = itemTuple.f2();

        for (Object scoreObj : scores) {
            final Tuple2<Long, Float> score = (Tuple2<Long, Float>)scoreObj;

            if (buffer.size() < MAX_RESULTS) {
                buffer.add(score);
                continue;
            }

            // If the value is larger than the largest entry, discard it.
            if (comparator.compare(score, buffer.last()) >= 0) {
                continue;
            }

            // Otherwise we remove the largest entry after adding the new one.
            buffer.add(score);
            buffer.pollLast();
        }

        return true;
    }

    @Override
    protected boolean tryProcessWm(int ordinal, @Nonnull Watermark wm) {
        // return super.tryProcessWm(ordinal, wm);
        return tryEmit(Tuple3.tuple3(requestId, timestamp, buffer.toArray())) && super.tryProcessWm(ordinal, wm);
    }

    private static class ScoreComparator implements Comparator<Tuple2<Long, Float>> {
        @Override
        public int compare(Tuple2<Long, Float> a, Tuple2<Long, Float> b) {
            return Float.compare(a.f1(), b.f1());
        }
    }
}

Answer 1

您必须始终记住两个顶点之间的项目可以重新排序。当您有并行请求时，它们的中间结果可以在CombineP中交错。

在CombineP中，您可以依赖中间结果的数量等于群集中成员数的事实。从init计算globalParallelism / localParallelism中参与成员的数量。当您收到此数量的中间体时，您可以发出最终结果。

另一个技巧可能是在每个成员上并行运行多个请求。您可以通过使用两个边来实现此目的： 1.广播+分布式边缘到并行= 1处理器 2.单播边缘到并行= N处理器

另请注意，localKeys不适合大型地图：查询大小为limited。

以下是上述代码。代码适用于Jet 0.5：

DAG：

DAG dag = new DAG();
Vertex sourceStream = dag.newVertex("source",
        streamMapP(QUERY_VECTOR_MAP_NAME,
                e -> e.getType() == EntryEventType.ADDED || e.getType() == EntryEventType.UPDATED,
                e -> entry(e.getKey(), e.getNewValue()),true));

Vertex identity = dag.newVertex("identity", mapP(identity()))
        .localParallelism(1);
Vertex map = dag.newVertex("map", peekOutputP(EuclideanDistanceMapP::new));
Vertex combine = dag.newVertex("combine", peekOutputP(new CombineMetaSupplier()));
Vertex sink = dag.newVertex("sink", writeListP(SINK_NAME));

dag.edge(between(sourceStream, identity)
           .broadcast()
           .distributed())
   .edge(between(identity, map))
   .edge(between(map, combine)
           .distributed()
           .partitioned((Entry item) -> item.getKey()))
   .edge(between(combine, sink));

EuclideanDistanceMapP类：

private static class EuclideanDistanceMapP extends AbstractProcessor {

    private IMap<Long, float[]> referenceVectors;
    final ScoreComparator comparator = new ScoreComparator();
    private Object pendingItem;

    @Override
    protected void init(@Nonnull Context context) throws Exception {
        this.referenceVectors = context.jetInstance().getMap(REFERENCE_VECTOR_MAP_NAME);
        super.init(context);
    }

    @Override
    protected boolean tryProcess0(@Nonnull Object item) {
        if (pendingItem == null) {
            final Entry<Long, float[]> query = (Entry<Long, float[]>) item;
            final long requestId = query.getKey();
            final float[] queryVector = query.getValue();

            final PriorityQueue<Entry<Long, Float>> buffer = new PriorityQueue<>(comparator.reversed());
            for (Long vectorKey : referenceVectors.localKeySet()) {
                float[] referenceVector = referenceVectors.get(vectorKey);
                float distance = 0.0f;

                for (int i = 0; i < queryVector.length; ++i) {
                    distance += (queryVector[i] - referenceVector[i]) * (queryVector[i] - referenceVector[i]);
                }

                final Entry<Long, Float> score = entry(vectorKey, (float) Math.sqrt(distance));
                if (buffer.size() < MAX_RESULTS || comparator.compare(score, buffer.peek()) < 0) {
                    if (buffer.size() == MAX_RESULTS)
                        buffer.remove();
                    buffer.add(score);
                }
            }
            pendingItem = entry(requestId, buffer.toArray(new Entry[0]));
        }
        if (tryEmit(pendingItem)) {
            pendingItem = null;
            return true;
        }
        return false;
    }
}

CombineP类：

private static class CombineP extends AbstractProcessor {
    private final ScoreComparator comparator = new ScoreComparator();
    private final Map<Long, PriorityQueue<Entry<Long, Float>>> buffer = new HashMap<>();
    private final Map<Long, Integer> accumulatedCount = new HashMap<>();
    private final int upstreamMemberCount;
    private Entry<Long, Entry<Long, Float>[]> pendingItem;

    private CombineP(int upstreamMemberCount) {
        this.upstreamMemberCount = upstreamMemberCount;
    }

    @Override
    protected boolean tryProcess0(@Nonnull Object item) {
        if (pendingItem == null) {
            final Entry<Long, Entry<Long, Float>[]> localValue = (Entry<Long, Entry<Long, Float>[]>) item;
            long requestId = localValue.getKey();
            PriorityQueue<Entry<Long, Float>> globalValue = buffer.computeIfAbsent(requestId, key -> new PriorityQueue<>(comparator.reversed()));
            globalValue.addAll(asList(localValue.getValue()));
            while (globalValue.size() > MAX_RESULTS) {
                globalValue.remove();
            }
            int count = accumulatedCount.merge(requestId, 1, Integer::sum);
            if (count == upstreamMemberCount) {
                // we've received enough local values, let's emit and remove the accumulator
                pendingItem = entry(requestId, globalValue.toArray(new Entry[0]));
                Arrays.sort(pendingItem.getValue(), comparator);
                buffer.remove(requestId);
                accumulatedCount.remove(requestId);
            } else {
                return true;
            }
        }
        if (tryEmit(pendingItem)) {
            pendingItem = null;
            return true;
        }
        return false;
    }
}

您还需要CombineP的自定义元供应商：

private static class CombineMetaSupplier implements ProcessorMetaSupplier {
    private int upstreamMemberCount;

    @Override
    public void init(@Nonnull Context context) {
        upstreamMemberCount = context.totalParallelism() / context.localParallelism();
    }

    @Nonnull
    @Override
    public Function<Address, ProcessorSupplier> get(@Nonnull List<Address> addresses) {
        return address -> ProcessorSupplier.of(() -> new CombineP(upstreamMemberCount));
    }
}

来自流媒体源的MapReduce单项与Hazelcast Jet中地图中的有限项

1 个答案: