Question

最近，我正在尝试使用Apache Flink进行快速批处理。我有一个带有column：value和不相关的索引列的表

基本上，我想计算每5行值的平均值和范围。然后，我将基于刚才计算的平均值来计算平均值和标准偏差。因此，我认为最好的方法是使用Tumble窗口。

看起来像这样

DataSet<Tuple2<Double, Integer>> rawData = {get the source data};
Table table = tableEnvironment.fromDataSet(rawData);
Table groupedTable = table
            .window(Tumble.over("5.rows").on({what should I write?}).as("w")
            .groupBy("w")
            .select("f0.avg, f0.max-f0.min");

{The next step is to use groupedTable to calculate overall mean and stdDev}

但是我不知道用.on()写什么。我已经尝试过"proctime"，但它说没有这样的输入。我只希望它按从源中读取的顺序进行分组。但这必须是时间属性，因此我不能使用"f2"-索引列也是如此。

我需要添加时间戳吗？批处理中是否有必要，它会减慢计算速度吗？解决此问题的最佳方法是什么？

更新： 我试图在表格API中使用滑动窗口，但它使我异常。

// Calculate mean value in each group
    Table groupedTable = table
            .groupBy("f0")
            .select("f0.cast(LONG) as groupNum, f1.avg as avg")
            .orderBy("groupNum");

//Calculate moving range of group Mean using sliding window
    Table movingRangeTable = groupedTable
            .window(Slide.over("2.rows").every("1.rows").on("groupNum").as("w"))
            .groupBy("w")
            .select("groupNum.max as groupNumB, (avg.max - avg.min) as MR");

例外是：

线程“ main”中的异常java.lang.UnsupportedOperationException：当前不支持在事件时间对滑动组窗口进行计数。

在org.apache.flink.table.plan.nodes.dataset.DataSetWindowAggregate.createEventTimeSlidingWindowDataSet（DataSetWindowAggregate.scala：456）

在org.apache.flink.table.plan.nodes.dataset.DataSetWindowAggregate.translateToPlan（DataSetWindowAggregate.scala：139）

...

这是否意味着Table API不支持滑动窗口？如果我没记错的话，DataSet API中没有窗口函数。那如何在批处理中计算移动范围呢？

Answer 1

window子句用于基于窗口函数（例如Tumble或Session）定义分组。除非您指定行的顺序，否则在Table API（或SQL）中无法很好地定义每5行的分组。这是在on函数的Tumble子句中完成的。由于此功能源自流处理，因此on子句需要一个timestamp属性。

您可以使用currentTimestamp()函数获取当前时间的时间戳。但是，我应该指出，Flink将对数据进行排序，因为它不知道该函数的单调属性。而且，所有这些都将具有1的并行度，因为没有子句允许分区。

或者，您也可以实现用户定义的标量函数，该函数将index属性转换为时间戳（实际上是Long值）。但是同样，Flink会处理全部数据。

Apache Flink：如何使用Table API将每n行分组？

1 个答案: