Question

我的数据库中有一个表格，用这种方式记录来自几个传感器的读数：

CREATE TABLE [test].[readings] (
    [timestamp_utc] DATETIME2(0) NOT NULL, -- 48bits
    [sensor_id] INT NOT NULL, -- 32 bits
    [site_id] INT NOT NULL, -- 32 bits
    [reading] REAL NOT NULL, -- 64 bits
    PRIMARY KEY([timestamp_utc], [sensor_id], [site_id])
)

CREATE TABLE [test].[sensors] (
    [sensor_id] int NOT NULL ,
    [measurement_type_id] int NOT NULL,
    [site_id] int NOT NULL ,
    [description] varchar(255) NULL ,
    PRIMARY KEY ([sensor_id], [site_id])
)

我希望能够轻松地从所有这些读数中得出统计数据。

我想做的一些问题：

Get me all readings for site_id = X between date_hour1 and date_hour2

Get me all readings for site_id = X and sensor_id in <list> between date_hour1 and date_hour2

Get me all readings for site_id = X and sensor measurement type = Z between date_hour1 and date_hour2

Get me all readings for site_id = X, aggregated (average) by DAY between date_hour1 and date_hour2

Get me all readings for site_id = X, aggregated (average) by DAY between date_hour1 and date_hour2 but in UTC+3（这应该会产生与先前查询不同的结果，因为现在天数的开始和结束都会移动3小时）

Get me min, max, std, mean for all readings for site_id = X between date_hour1 and date_hour2

到目前为止，我一直在使用Java查询数据库并在本地执行所有这些处理。但这最终有点慢，代码在编写和维护时仍然很乱（太多的cicles，执行重复任务的通用函数，大型/冗长的代码库等）...

更糟糕的是，表readings是巨大的（因此主键的重要性，也是性能索引），也许我应该使用TimeSeries数据库（有什么好的）？）。我正在使用SQL Server。

最好的方法是什么？我觉得我正在重新发明轮子，因为所有这些都是一个分析应用......

我知道这些查询听起来很简单，但当你尝试对所有这些进行参数化时，你最终会得到一个像这样的怪物：

-- Sums all device readings, returns timestamps in localtime according to utcOffset (if utcOffset = 00:00, then timestamps are in UTC)
CREATE PROCEDURE upranking.getSumOfReadingsForDevices
    @facilityId int,
    @deviceIds varchar(MAX),
    @beginTS datetime2,
    @endTS datetime2,
    @utcOffset varchar(6),
    @resolution varchar(6) -- NO, HOURS, DAYS, MONTHS, YEARS
AS BEGIN
    SET NOCOUNT ON -- http://stackoverflow.com/questions/24428928/jdbc-sql-error-statement-did-not-return-a-result-set
    DECLARE @deviceIdsList TABLE (
            id int NOT NULL
    );

    DECLARE @beginBoundary datetime2,
            @endBoundary datetime2;

    SELECT @beginBoundary = DATEADD(day, -1, @beginTS);
    SELECT @endBoundary = DATEADD(day, 1, @endTS);

    -- We shift sign from the offset because we are going to convert the zone for the entire table and not beginTS endTS themselves
    SELECT @utcOffset = CASE WHEN LEFT(@utcOffset, 1) = '+' THEN STUFF(@utcOffset, 1, 1, '-') ELSE STUFF(@utcOffset, 1, 1, '+') END

    INSERT INTO @deviceIdsList
    SELECT convert(int, value) FROM string_split(@deviceIds, ',');

    SELECT SUM(reading) as reading,
           timestamp_local
    FROM (
            SELECT reading,
                   upranking.add_timeoffset_to_datetime2(timestamp_utc, @utcOffset, @resolution) as timestamp_local
            FROM upranking.readings
            WHERE
                device_id IN (SELECT id FROM @deviceIdsList)
                AND facility_id = @facilityId
                AND timestamp_utc BETWEEN @beginBoundary AND @endBoundary
         ) as innertbl
    WHERE timestamp_local BETWEEN @beginTS AND @endTS
    GROUP BY timestamp_local
    ORDER BY timestamp_local
END
GO

这是一个查询，它接收站点ID（在本例中为facilityId），传感器ID列表（本例中为deviceIds），开始和结束时间戳，然后是字符串中的UTC偏移量，如＆＃ 34 + XX：XX＆＃34;或者＆＃34; -xx：xx＆＃34;，以分辨率结束，该分辨率基本上说明SUM将如何聚合结果（考虑UTC偏移）。

由于我正在使用 Java ，乍一看我可以使用Hibernate或其他东西，但我觉得Hibernate并不适合这些类型的查询。

Answer 1

您的结构乍一看看起来不错，但查看您的查询会让我觉得您可能需要尝试调整。性能从来都不是一个容易的主题，并且要找到一个适合所有答案的＃34;并不容易。这里有一些注意事项：

您想要更好的读取或写性能吗？如果您想要更好的读取性能，则需要重新考虑索引。当然你有一个主键但你的大多数查询都没有使用它（所有三个字段）。尝试为[sensor_id], [site_id]创建索引。
你可以使用缓存吗？如果某些搜索是经常性的，并且您的应用程序是数据库的单一入口点，那么请评估您的用例是否会从缓存中受益。
如果表readings很大，那么请考虑使用某种分区策略。查看MSSQL documentation
如果您不需要实时数据，请尝试某种搜索引擎，例如Elastic Search

从SQL表中创建统计信息

1 个答案: