Question

这是我的输入数据

GroupId Serial Action
1        1      Start
1        2      Run
1        3      Jump
1        8      End
2        9      Shop
2        10     Start
2        11     Run

对于组中的每个活动序列，我想查找动作对，其中Action1.SerialNo = Action2.SerialNo + k以及它可能发生的次数

Suppose k  = 1, then output will be

FirstAction  NextAction Frequency
Start Run 2
Run Jump  1
Shop Start 1

如果输入表包含数百万个条目，我怎样才能在SQL中快速完成此任务。

Answer 1

tful，这应该产生你想要的结果，但我不知道它是否会像你想的那样快。值得一试。

create table Actions(
  GroupId int,
  Serial int,
  "Action" varchar(20) not null,
  primary key (GroupId, Serial)
);

insert into Actions values
  (1,1,'Start'), (1,2,'Run'), (1,3,'Jump'),
  (1,8,'End'), (2,9,'Shop'), (2,10,'Start'),
  (2,11,'Run');
go

declare @k int = 1;
with ActionsDoubled(Serial,Tag,"Action") as (
  select
    Serial, 'a', "Action"
  from Actions as A
  union all
  select
    Serial-@k, 'b', "Action"
  from Actions
  as B
), Pivoted(Serial,a,b) as (
  select Serial,a,b
  from ActionsDoubled
  pivot (
    max("Action") for Tag in ([a],[b])
  ) as P
)
  select 
    a, b, count(*) as ct
    from Pivoted
    where a is not NULL and b is not NULL
    group by a,b
    order by a,b;
go

drop table Actions;

如果要对稳定数据上的各种@k值进行相同的计算，从长远来看，这可能会更好：

declare @k int = 1;
  select 
    Serial, 'a' as Tag, "Action"
  into ActionsDoubled
  from Actions as A
  union all
  select
    Serial-@k, 'b', "Action"
  from Actions
  as B;
go

create unique clustered index AD_S on ActionsDoubled(Serial,Tag);
create index AD_a on ActionsDoubled(Tag,Serial);
go

with Pivoted(Serial,a,b) as (
  select Serial,a,b
  from ActionsDoubled
  pivot (
    max("Action") for Tag in ([a],[b])
  ) as P
)
  select 
    a, b, count(*) as ct
    from Pivoted
    where a is not NULL and b is not NULL
    group by a,b
    order by a,b;
go

drop table ActionsDoubled;

Answer 2

SELECT a1.Action AS FirstActio, a2.Action AS NextAction, COUNT(*) AS Frequency
FROM Activities a1 JOIN Activities a2
 ON (a1.GroupId = a2.GroupId AND a1.Serial = a2.Serial + @k)
GROUP BY a1.Action, a2.Action;

Answer 3

问题在于：无论如何，您的查询都必须遍历每一行。

通过将每个组作为单独的查询单独处理，可以使数据库更易于管理。特别是如果每组的大小都是小的。

在幕后有很多事情发生，当查询必须扫描整个表格时，实际上最终会比你有效覆盖所有百万行的小块块慢很多倍。

例如：

--Stickler for clean formatting...
SELECT 
    a1.Action AS FirstAction, 
    a2.Action AS NextAction,
    COUNT(*) AS Frequency
FROM 
     Activities a1 JOIN Activities a2
     ON (a1.groupid = a2.groupid 
         AND a1.Serial = a2.Serial + @k)   
WHERE
     a1.groupid = 1
GROUP BY
     a1.Action,
     a2.Action;

顺便说一下，表上有一个索引（GroupId，Serial），对吧？

如何进行这种数据转换

3 个答案: