我有一些批量为5的元组,其中包含来自用户的展示次数:
Batch 1:
[UUID1, clientId1]
[UUID2, clientId1]
[UUID2, clientId1]
[UUID2, clientId1]
[UUID3, clientId2]
Batch 2:
[UUID4, clientId1]
[UUID5, clientId1]
[UUID5, clientId1]
[UUID6, clientId2]
[UUID6, clientId2]
这是我保存计数状态的例子:
TridentState ClientState = impressionStream
.groupBy(new Fields("clientId"))
.persistentAggregate(getCassandraStateFactory("users", "DataComputation",
"UserImpressionCounter"), new Count(), new Fields("count));
Stream ClientStream = ClientState.newValuesStream();
我有清晰的数据库并运行我的拓扑。在通过clientId对流进行分组后,我使用persistentAggregate函数和Count聚合器保存状态。
对于第一批是newValuesStream方法之后的结果:[clientId1, 4]
,[clientId2, 1]
。
对于第二批:[clientId1, 7]
,[clientId2, 3]
按预期方式。
ClientStream用于几个分支和一个分支
这些分支我需要处理元组,以便批量为1,因为我需要有关每个的计数信息
元组。
大小为1的批处理显然是垃圾,所以在更新它并发出之前,我必须以某种方式找出计数器的先前状态
这个信息与元组有已更新的计数器,例如第二批[clientId1, 7, 4]
。
有人知道怎么做吗?
答案 0 :(得分:0)
我已经通过添加新的聚合器并使用持久聚合连接解决了这个问题:
TridentState ClientState = impressionStream
.groupBy(new Fields("clientId"))
.persistentAggregate(getCassandraStateFactory("users", "DataComputation",
"UserImpressionCounter"), new Count(), new Fields("count));
Stream ClientBatchAggregationStream = impressionStream
.groupBy(new Fields("clientId"))
.aggregate(new SumCountAggregator(), new Fields("batchCount"));
Stream GroupingPeriodCounterStateStream = topology
.join(ClientState.newValuesStream(), new Fields("clientId"),
ClientBatchAggregationStream, new Fields("clientId"),
new Fields("clientId", "count", "batchCount"));
SumCountAggregator:
public class SumCountAggregator extends BaseAggregator<SumCountAggregator.CountState> {
static class CountState {
long count = 0;
}
@Override
public CountState init(Object batchId, TridentCollector collector) {
return new CountState();
}
@Override
public void aggregate(CountState state, TridentTuple tuple, TridentCollector collector) {
state.count += 1;
}
@Override
public void complete(CountState state, TridentCollector collector) {
collector.emit(new Values(state.count));
}
}