Question

我目前正在使用cassandra进行POC。

我想做什么：有不同数量的传感器（前面从未知道），每个传感器每秒会提供几个值。我想要做的是计算每秒，分钟，小时等的平均值，最小值，最大值，速度。

我如何建模我的数据：有多个列族; raw，avg-5-second，avg-60-second等rowid是传感器id，例如machinex：内存。 columname是时间戳，列值是度量。

到目前为止我所拥有的：我创建了一个系统，我为一个传感器生成数据（所以单个rowid）。我有一些任务可以为给定的rowid获取一些数据，并将结果存储在聚合的columnfamilies中。

示例：

Cluster cluster = HFactory.getOrCreateCluster（“test-cluster”，“localhost：9160”）; Keyspace keyspace = createKeyspace（cluster，“Measurements”）;

String machine1 = "foo:dev:192.168.1.1:5701";
String rowId = machine1 + ":operationCount";

DatapointRepository rawRepo = new DatapointRepository(cluster, keyspace, "Measurements");
DatapointRepository avgSecondRepo = new DatapointRepository(cluster, keyspace, "averageSecond");
DatapointRepository avgFiveSecondRepo = new DatapointRepository(cluster, keyspace, "averageFiveSeconds");
DatapointRepository maxFiveSecondRepo = new DatapointRepository(cluster, keyspace, "maxFiveSeconds");

ScheduledExecutorService scheduler = new ScheduledThreadPoolExecutor(10);
scheduler.scheduleAtFixedRate(
        new RollupRunnable(
        rawRepo,
        avgSecondRepo,
        rowId,
        "average 1 second",
                new AggregateFunctionFactory(AverageFunction.class)),
        0, 1, TimeUnit.SECONDS);
scheduler.scheduleAtFixedRate(
        new RollupRunnable(
        avgSecondRepo,
        avgFiveSecondRepo,
        rowId,
        "average 5 seconds",
                new AggregateFunctionFactory(AverageFunction.class)),
        0, 5, TimeUnit.SECONDS);
scheduler.scheduleAtFixedRate(
        new RollupRunnable(
        avgSecondRepo,
        maxFiveSecondRepo,
        rowId,
        "maximum 5 seconds",
                new AggregateFunctionFactory(MaximumFunction.class)),
        0, 5, TimeUnit.SECONDS);


long startTime = System.currentTimeMillis();

new GenerateMeasurementsThread(rawRepo, machine1).start();

Thread.sleep(30000);

long endTime = System.currentTimeMillis();

System.out.println("average seconds:");
print(avgSecondRepo, startTime, endTime, machine1 + ":operationCount");
System.out.println("average 5 seconds:");
print(avgFiveSecondRepo, startTime, endTime, machine1 + ":operationCount");
System.out.println("max 5 seconds:");
print(maxFiveSecondRepo, startTime, endTime, machine1 + ":operationCount");


System.out.println("finished");
System.exit(0);

所以如果我有一个传感器（所以单行ID），或者如果我事先知道哪些传感器，那么一切正常。问题是我有一个可变数量的传感器，新传感器可能出现在任何给定时刻，旧传感器可能会停止发送数据。

我的一个大问题是：如何确定在给定时间内哪些传感器可用？一旦我知道，我就可以为每个传感器创建一个聚合任务。

Answer 1

“我的一个大问题是：如何确定在给定时间内哪些传感器可用？一旦我知道，我就可以为每个传感器创建一个聚合任务。”

到目前为止，您所做的是通过传感器THEN时间戳索引数据（sensorId = rowId，timestamp =列名称）

您现在要做的是先按时间索引。我担心你需要创建额外的列族：

rowId = xxx // whatever value, doest not really matter

column name = timestamp

column value = sensor ID

Answer 2

@userxxxx

“我已经实现了您的建议，除了一个错误之外它还可以。如果同一'时间'有多个传感器数据点，则只显示最后保存的数据点的名称。”

轻松修复：

rowId = xxx // whatever value, doest not really matter

column name = composite of(timestamp,sensorId)

column value = nothing

通过将列名设置为timestamp和sensorId的组合，您将涵盖在同一时间拥有多个传感器的情况。

由于sensorID信息直接存储在列中，因此您不再需要列值。这称为无值列族

创建此类表的CQL脚本

CREATE TABLE sensor_index_by_date
(

   row_id text, // whatever
   date timestamp,
   sensor_id bigint,
   PRIMARY KEY(rowId,date,sensor_id)
);

具有任意行数的时间序列数据

2 个答案: