在Spark中使用Java进行编码时,我一直面临Spark中reduceByKey中的参数问题。我不明白reduceByKey函数中使用的参数。我知道reduceByKey的含义及其工作方式。但是,下面的代码与基本的火花代码示例(例如,字数统计示例)
略有不同如您所见,reduceByKey中有两个参数:new KrukalReducer(numPoints)
和numSubGraphs.
numSubGraphs
是整数值,KruskalReducer
是java类。
mstToBeMergedResult = mstToBeMerged.mapToPair(new SetPartitionIdFunction(K)).reduceByKey(
new KruskalReducer(numPoints), numSubGraphs);
我不明白为什么这些整数变量用于reduceByKey。我尝试使用ReduceByKey将两个参数连接到概念但未能得到它。
我附上了java类供您参考。
public static final class KruskalReducer implements Function2<Iterable<Edge>, Iterable<Edge>, Iterable<Edge>>{
private static final long serialVersionUID = 1L;
private transient UnionFind uf = null;
private final int numPoints;
public KruskalReducer(int numPoints) {
this.numPoints = numPoints;
}
// merge sort
@Override
public Iterable<Edge> call(Iterable<Edge> leftEdges, Iterable<Edge> rightEdges) throws Exception{
uf = new UnionFind(numPoints);
List<Edge> edges = Lists.newArrayList();
Iterator<Edge> leftEdgesIterator = leftEdges.iterator();
Iterator<Edge> rightEdgesIterator = rightEdges.iterator();
Edge leftEdge = leftEdgesIterator.next();
Edge rightEdge = rightEdgesIterator.next();
Edge minEdge;
boolean isLeft;
Iterator<Edge> minEdgeIterator;
final int numEdges = numPoints - 1;
do {
if (leftEdge.getWeight() < rightEdge.getWeight()) {
minEdgeIterator = leftEdgesIterator;
minEdge = leftEdge;
isLeft = true;
} else {
minEdgeIterator = rightEdgesIterator;
minEdge = rightEdge;
isLeft = false;
}
if (uf.unify(minEdge.getLeft(), minEdge.getRight())) {
edges.add(minEdge);
}
minEdge = minEdgeIterator.hasNext() ? minEdgeIterator.next() : null;
if (isLeft) {
leftEdge = minEdge;
} else {
rightEdge = minEdge;
}
}while (minEdge != null && edges.size() < numEdges);
minEdge = isLeft ? rightEdge : leftEdge;
minEdgeIterator = isLeft ? rightEdgesIterator : leftEdgesIterator;
while (edges.size() < numEdges && minEdgeIterator.hasNext()) {
if (uf.unify(minEdge.getLeft(), minEdge.getRight())) {
edges.add(minEdge);
}
minEdge = minEdgeIterator.next();
}
return edges;
}
}
此外,完整的相关代码如下所示。 (如果您感到困惑,可以跳过此代码)
JavaPairRDD<Integer, Iterable<Edge>> mstToBeMerged = partitions.combineByKey(new CreateCombiner(),
new Merger(), new KruskalReducer(numPoints));
JavaPairRDD<Integer, Iterable<Edge>> mstToBeMergedResult = null;
while (numSubGraphs > 1){
numSubGraphs = (numSubGraphs + (K - 1)) / K;
mstToBeMergedResult = mstToBeMerged.mapToPair(new SetPartitionIdFunction(K)).reduceByKey(
new KruskalReducer(numPoints), numSubGraphs);
mstToBeMerged = mstToBeMergedResult;
displayResults(mstToBeMerged);
}
private static class CreateCombiner implements Function<Edge, Iterable<Edge>>{
private static final long serialVersionUID = 1L;
@Override
public Iterable<Edge> call(Edge edge) throws Exception {
List<Edge> edgeList = Lists.newArrayListWithCapacity(1);
edgeList.add(edge);
return edgeList;
}
}
private static class Merger implements Function2<Iterable<Edge>, Edge, Iterable<Edge>>{
private static final long serialVersionUID = 1L;
@Override
public Iterable<Edge> call(Iterable<Edge> list, Edge edge) throws Exception {
List<Edge> mergeList = Lists.newArrayList(list);
mergeList.add(edge);
return mergeList;
}
}
答案 0 :(得分:1)
我不明白为什么会使用这样的整数变量 reduceByKey。我尝试将两个参数连接到概念中 ReduceByKey但未能得到它。
如果我正在阅读正确的超载:
def reduceByKey(func: JFunction2[V, V, V], numPartitions: Int): JavaPairRDD[K, V] =
fromRDD(rdd.reduceByKey(func, numPartitions))
然后,您传递的数字是基础RDD中的分区数。因为reduceByKey
是一个随机边界操作,所以数据将被重新分区,并且传递这些数字可以控制分配的分区数。