Spark中ReduceByKey中的参数

时间:2017-06-13 11:06:58

标签: java apache-spark

在Spark中使用Java进行编码时,我一直面临Spark中reduceByKey中的参数问题。我不明白reduceByKey函数中使用的参数。我知道reduceByKey的含义及其工作方式。但是,下面的代码与基本的火花代码示例(例如,字数统计示例)

略有不同

如您所见,reduceByKey中有两个参数:new KrukalReducer(numPoints)numSubGraphs. numSubGraphs是整数值,KruskalReducer是java类。

 mstToBeMergedResult = mstToBeMerged.mapToPair(new SetPartitionIdFunction(K)).reduceByKey(
                    new KruskalReducer(numPoints), numSubGraphs);

我不明白为什么这些整数变量用于reduceByKey。我尝试使用ReduceByKey将两个参数连接到概念但未能得到它。

我附上了java类供您参考。

 public static final class KruskalReducer implements Function2<Iterable<Edge>, Iterable<Edge>, Iterable<Edge>>{
        private static final long serialVersionUID = 1L;
        private transient UnionFind uf = null;
        private final int numPoints;

        public KruskalReducer(int numPoints) {
            this.numPoints = numPoints;
        }

        // merge sort
        @Override
        public Iterable<Edge> call(Iterable<Edge> leftEdges, Iterable<Edge> rightEdges) throws Exception{
            uf = new UnionFind(numPoints);
            List<Edge> edges = Lists.newArrayList();
            Iterator<Edge> leftEdgesIterator = leftEdges.iterator();
            Iterator<Edge> rightEdgesIterator = rightEdges.iterator();
            Edge leftEdge = leftEdgesIterator.next();
            Edge rightEdge = rightEdgesIterator.next();
            Edge minEdge;
            boolean isLeft;
            Iterator<Edge> minEdgeIterator;
            final int numEdges = numPoints - 1;
            do {
                if (leftEdge.getWeight() < rightEdge.getWeight()) {
                    minEdgeIterator = leftEdgesIterator;
                    minEdge = leftEdge;
                    isLeft = true;
                } else {
                    minEdgeIterator = rightEdgesIterator;
                    minEdge = rightEdge;
                    isLeft = false;
                }
                if (uf.unify(minEdge.getLeft(), minEdge.getRight())) {
                    edges.add(minEdge);
                }
                minEdge = minEdgeIterator.hasNext() ? minEdgeIterator.next() : null;
                if (isLeft) {
                    leftEdge = minEdge;
                } else {
                    rightEdge = minEdge;
                }
            }while (minEdge != null && edges.size() < numEdges);
            minEdge = isLeft ? rightEdge : leftEdge;
            minEdgeIterator = isLeft ? rightEdgesIterator : leftEdgesIterator;

            while (edges.size() < numEdges && minEdgeIterator.hasNext()) {
                if (uf.unify(minEdge.getLeft(), minEdge.getRight())) {
                    edges.add(minEdge);
                }
                minEdge = minEdgeIterator.next();
            }
            return edges;
        }
    }

此外,完整的相关代码如下所示。 (如果您感到困惑,可以跳过此代码)

   JavaPairRDD<Integer, Iterable<Edge>> mstToBeMerged = partitions.combineByKey(new CreateCombiner(),
                    new Merger(), new KruskalReducer(numPoints));


JavaPairRDD<Integer, Iterable<Edge>> mstToBeMergedResult = null;
while (numSubGraphs > 1){
     numSubGraphs = (numSubGraphs + (K - 1)) / K;
     mstToBeMergedResult = mstToBeMerged.mapToPair(new SetPartitionIdFunction(K)).reduceByKey(
              new KruskalReducer(numPoints), numSubGraphs);
     mstToBeMerged = mstToBeMergedResult;
     displayResults(mstToBeMerged);
}


private static class CreateCombiner implements Function<Edge, Iterable<Edge>>{

        private static final long serialVersionUID = 1L;

        @Override
        public Iterable<Edge> call(Edge edge) throws Exception {
            List<Edge> edgeList = Lists.newArrayListWithCapacity(1);
            edgeList.add(edge);
            return edgeList;
        }
    }

    private static class Merger implements Function2<Iterable<Edge>, Edge, Iterable<Edge>>{

        private static final long serialVersionUID = 1L;

        @Override
        public Iterable<Edge> call(Iterable<Edge> list, Edge edge) throws Exception {
            List<Edge> mergeList = Lists.newArrayList(list);
            mergeList.add(edge);
            return mergeList;
        }
    }

1 个答案:

答案 0 :(得分:1)

  

我不明白为什么会使用这样的整数变量   reduceByKey。我尝试将两个参数连接到概念中   ReduceByKey但未能得到它。

如果我正在阅读正确的超载:

def reduceByKey(func: JFunction2[V, V, V], numPartitions: Int): JavaPairRDD[K, V] =
  fromRDD(rdd.reduceByKey(func, numPartitions))

然后,您传递的数字是基础RDD中的分区数。因为reduceByKey是一个随机边界操作,所以数据将被重新分区,并且传递这些数字可以控制分配的分区数。