我有一个flink DataSet(从文件中读取),其中包含来自许多不同传感器的传感器读数。我使用flinks groupBy()方法将数据组织为每个传感器的UnsortedGrouping。接下来,我想以分布式方式在我的DataSet中的每个UnsortedGrouping上运行KMeans算法。
我的问题是,如何使用flink有效地实现此功能。 下面是我当前的实现:我编写了自己的groupReduce()方法,将flink KMeans算法应用于每个UnsortedGrouping。这段代码有效,但看起来非常慢,并且占用大量内存。
我认为这与我必须做的数据重组有关。必须执行多次数据转换才能运行代码,因为我不知道如何更有效地执行此操作:
当然,必须有一种更有效率和更高效的方式来实现这一点? 有人能告诉我如何以干净,高效的flink方式实现这一目标吗?
// *************************************************************************
// VARIABLES
// *************************************************************************
static int numberClusters = 10;
static int maxIterations = 10;
static int sensorCount = 117;
static ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// *************************************************************************
// PROGRAM
// *************************************************************************
public static void main(String[] args) throws Exception {
final long startTime = System.currentTimeMillis();
String fileName = "C:/tmp/data.nt";
DataSet<String> text = env.readTextFile(fileName);
// filter relevant DataSet from text file input
UnsortedGrouping<Tuple2<Integer,Point>> points = text
.filter(x -> x.contains("Value") && x.contains("valueLiteral")).filter(x -> !x.contains("#string"))
.map(x -> new Tuple2<Integer, Point>(
Integer.parseInt(x.substring(x.indexOf("_") + 1, x.indexOf(">"))) % sensorCount,
new Point(Double.parseDouble(x.split("\"")[1]))))
.filter(x -> x.f0 < 10)
.groupBy(0);
DataSet<Tuple2<Integer, Point>> output = points.reduceGroup(new DistinctReduce());
output.print();
// print the execution time
final long endTime = System.currentTimeMillis();
System.out.println("Total execution time: " + (endTime - startTime) + "ms");
}
public static class DistinctReduce implements GroupReduceFunction<Tuple2<Integer, Point>, Tuple2<Integer, Point>> {
private static final long serialVersionUID = 1L;
@Override public void reduce(Iterable<Tuple2<Integer, Point>> in, Collector<Tuple2<Integer, Point>> out) throws Exception {
AtomicInteger counter = new AtomicInteger(0);
List<Point> pointsList = new LinkedList<Point>();
for (Tuple2<Integer, Point> t : in) {
pointsList.add(new Point(t.f1.x));
}
DataSet<Point> points = env.fromCollection(pointsList);
DataSet<Centroid> centroids = points
.distinct()
.first(numberClusters)
.map(x -> new Centroid(counter.incrementAndGet(), x));
//DataSet<String> test = centroids.map(x -> String.format("Centroid %s", x)); //test.print();
IterativeDataSet<Centroid> loop = centroids.iterate(maxIterations);
DataSet<Centroid> newCentroids = points // compute closest centroid for each point
.map(new SelectNearestCenter()).withBroadcastSet(loop,"centroids") // count and sum point coordinates for each centroid
.map(new CountAppender())
.groupBy(0)
.reduce(new CentroidAccumulator()) // compute new centroids from point counts and coordinate sums
.map(new CentroidAverager());
// feed new centroids back into next iteration
DataSet<Centroid> finalCentroids = loop.closeWith(newCentroids);
DataSet<Tuple2<Integer, Point>> clusteredPoints = points // assign points to final clusters
.map(new SelectNearestCenter()).withBroadcastSet(finalCentroids, "centroids");
// emit result System.out.println("Results from the KMeans algorithm:");
clusteredPoints.print();
// emit all unique strings.
List<Tuple2<Integer, Point>> clusteredPointsList = clusteredPoints.collect();
for(Tuple2<Integer, Point> t : clusteredPointsList) {
out.collect(t);
}
}
}
答案 0 :(得分:0)
您必须先对数据点和质心进行分组。然后迭代质心并用数据点对它们进行组合。对于组中的每个点,将其分配给最近的质心。然后,您对初始组索引和质心索引进行分组,以减少分配给同一质心的所有点。这将是一次迭代的结果。
代码可能看起来如下:
DataSet<Tuple2<Integer, Point>> groupedPoints = ...
DataSet<Tuple2<Integer, Centroid>> groupCentroids = ...
IterativeDataSet<Tuple2<Integer, Centroid>> groupLoop = groupCentroids.iterate(10);
DataSet<Tuple2<Integer, Centroid>> newGroupCentroids = groupLoop
.coGroup(groupedPoints).where(0).equalTo(0).with(new CoGroupFunction<Tuple2<Integer,Centroid>, Tuple2<Integer,Point>, Tuple4<Integer, Integer, Point, Integer>>() {
@Override
public void coGroup(Iterable<Tuple2<Integer, Centroid>> centroidsIterable, Iterable<Tuple2<Integer, Point>> points, Collector<Tuple4<Integer, Integer, Point, Integer>> out) throws Exception {
// cache centroids
List<Tuple2<Integer, Centroid>> centroids = new ArrayList<>();
Iterator<Tuple2<Integer, Centroid>> centroidIterator = centroidsIterable.iterator();
for (Tuple2<Integer, Point> pointTuple : points) {
double minDistance = Double.MAX_VALUE;
int minIndex = -1;
Point point = pointTuple.f1;
while (centroidIterator.hasNext()) {
centroids.add(centroidIterator.next());
}
for (Tuple2<Integer, Centroid> centroidTuple : centroids) {
Centroid centroid = centroidTuple.f1;
double distance = point.euclideanDistance(centroid);
if (distance < minDistance) {
minDistance = distance;
minIndex = centroid.id;
}
}
out.collect(Tuple4.of(minIndex, pointTuple.f0, point, 1));
}
}})
.groupBy(0, 1).reduce(new ReduceFunction<Tuple4<Integer, Integer, Point, Integer>>() {
@Override
public Tuple4<Integer, Integer, Point, Integer> reduce(Tuple4<Integer, Integer, Point, Integer> value1, Tuple4<Integer, Integer, Point, Integer> value2) throws Exception {
return Tuple4.of(value1.f0, value1.f1, value1.f2.add(value2.f2), value1.f3 + value2.f3);
}
}).map(new MapFunction<Tuple4<Integer,Integer,Point,Integer>, Tuple2<Integer, Centroid>>() {
@Override
public Tuple2<Integer, Centroid> map(Tuple4<Integer, Integer, Point, Integer> value) throws Exception {
return Tuple2.of(value.f1, new Centroid(value.f0, value.f2.div(value.f3)));
}
});
DataSet<Tuple2<Integer, Centroid>> result = groupLoop.closeWith(newGroupCentroids);