Question

我们计划在Cassandra中存储时间序列传感器数据。每个传感器每个采样时间点可以有多个数据点。我想将每个设备的所有数据点存储在一起。

我有一个想法是为我们收集的各种数据类型创建所有可能的列：

CREATE TABLE ddata (
  deviceID int,
  day timestamp,
  timepoint timestamp, 
  aparentPower int,
  actualPower int,
  actualEnergy int,
  temperature float,
  humidity float,
  ppmCO2 int,
  etc, etc, etc...
  PRIMARY KEY ((deviceID,day),timepoint)
) WITH
  clustering order by (timepoint DESC);

insert into ddata (deviceID,day,timepoint,temperature,humidity) values (1000001,'2013-09-02','2013-09-02 00:00:04',93,97.3);

 deviceid | day                      | timepoint                | actualenergy | actualpower | aparentpower | event | humidity | ppmco2 | temperature
----------+--------------------------+--------------------------+--------------+-------------+--------------+-------+----------+--------+-------------
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:04-0700 |         null |        null |         null |  null |     97.3 |   null |          93
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:03-0700 |         null |        null |         null |  null |     null |   null |          92
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:02-0700 |         null |        null |         null |  null |     null |   null |          91
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:01-0700 |         null |        null |         null |  null |     null |   null |          90

另一个想法是创建给定设备可能报告的各种数据点的地图集合：

CREATE TABLE ddata (
  deviceID int,
  day timestamp,
  timepoint timestamp, 
  feeds map<text,int>,
  PRIMARY KEY ((deviceID,day),timepoint)
) WITH
  clustering order by (timepoint DESC);

insert into ddata (deviceID,day,timepoint,feeds) values (1000001,'2013-09-01','2013-09-01 00:00:04',{'temp':73,'humidity':99});

 deviceid | day                      | timepoint                | event      | feeds
----------+--------------------------+--------------------------+------------+----------------------------------------------------------
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:04-0700 |       null |                             {'humidity': 97, 'temp': 93}
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:03-0700 |       null |                                             {'temp': 92}
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:02-0700 |       null |                                             {'temp': 91}
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:01-0700 |       null |                                             {'temp': 90}

人们对两种选择的看法是什么？

从我可以看到的第一个选项将允许更好地键入不同的数据类型（int与float），但使表格变得丑陋。
如果我避免使用集合类型，性能会更好吗？
是否会不断添加额外的列，因为要添加新的传感器数据类型需要担心什么？
我应该考虑哪些其他因素？
人们对此方案有哪些其他数据建模思路？

谢谢，克里斯

Answer 1

基本上，由于我们不知道会有多少测量值，因此我们需要一种动态的方法来描述Column系列。

正如您在第二个示例中所指出的，CQL提供了用于保存动态集合的地图数据类型。

第二个是首选。但也取决于您可能发出的查询。要从'feed'获取'temp'，应用程序必须解析地图输出。

Answer 2

我可以看到直接的利弊：

- 使用map列可以让您拥有“无限制”指标。（ nb 我认为你可以在map中存储多少数据存在限制
- 您将无法读取map中的单个值;如果每个指标都有列，则可以一次读取一个值;您仍然可以更新map
正如您在问题中提到的，map

这些是我能看到的最明显的差异。

Cassandra数据模型选项，所有潜在阅读类型的大量列，或地图集合？

2 个答案: