Question

我有兴趣为处理大量类似数据条目的应用程序设计基于SQL（实际上是SQLite）的存储。对于此示例，请将其设为聊天消息存储。

应用程序必须提供通过消息参与者，标签等过滤和分析数据的功能，所有这些都意味着N对N关系。

因此，模式（明星的类型）看起来像：

create table messages (
    message_id INTEGER PRIMARY KEY,
    time_stamp INTEGER NOT NULL
    -- other fact fields
);

create table users (
    user_id INTEGER PRIMARY KEY,
    -- user dimension data
);

create table message_participants (
    user_id INTEGER references users(user_id),
    message_id INTEGER references messages(message_id)
);

create table tags (
    tag_id INTEGER PRIMARY KEY,
    tag_name TEXT NOT NULL,
    -- tag dimension data
);

create table message_tags (
    tag_id INTEGER references tags(tag_id),
    message_id INTEGER references messages(message_id)
);

-- etc.

所以，一切顺利，直到我必须执行基于N对N维度的分析操作和过滤。鉴于消息表中的数百万行以及维度中的数千行（示例中显示的数量多于此行），所有联接都只是性能损失。

例如，我想分析每个用户参与的消息数量，因为数据是根据选定的标签，选定的用户和其他方面进行过滤的：

select U.user_id, U.user_name, count(1)
from messages as M
join message_participants as MP on M.message_id=MP.message_id
join user as U on MP.user_id=U.user_id
where
    MP.user_id not in ( /* some user ID's set */ )
    and M.time_stamp between @StartTime and @EndTime
    and 
        -- more fact table fields filtering
    and message_id in
        (select message_id
        from message_tags
        where tag_id in ( /* some tag ID's set */ ))
    and
        -- more N-to-N filtering
group by U.user_id

我受限于SQL，特别是SQLite。我确实在表格上使用了索引。

我有一些方法我没有看到改进架构，也许是一种聪明的方法来对其进行去标准化？

或者也许有办法以某种方式索引消息行中的维度键（我考虑使用FTS功能，但不确定是否搜索文本索引并加入结果将提供任何性能杠杆）？

Answer 1

发表评论的时间太长，可能对性能有所帮助，但并不是您问题的直接答案（您的架构似乎很好）：您是否尝试过搞乱查询本身？

我经常看到那种用于多对多的子选择过滤器，我发现在像这样的大型查询中，我经常看到运行CTE / join而不是where blag in (subselect)的性能改进：< / p>

;with tagMesages as (
    select distinct message_id
    from message_tags
    where tag_id in ( /* some tag ID's set */ )
) -- more N-to-N filtering
select U.user_id, U.user_name, count(1)
from messages as M
join message_participants as MP on M.message_id=MP.message_id
join user as U on MP.user_id=U.user_id
join tagMesages on M.message_id = tagMesages.message_id
where
    MP.user_id not in ( /* some user ID's set */ )
    and M.time_stamp between @StartTime and @EndTime
    and 
        -- more fact table fields filtering
group by U.user_id

我们可以说它们是相同的，但查询规划器有时会发现这更有用

免责声明：我不做SQLite，我做SQL Server，很抱歉，如果我发现了一些明显的（或其他）错误。

“数据仓库”式SQLite商店设计

1 个答案: