如何进行这种数据转换

时间:2009-08-22 03:09:21

标签: sql sql-server

这是我的输入数据

GroupId Serial Action
1        1      Start
1        2      Run
1        3      Jump
1        8      End
2        9      Shop
2        10     Start
2        11     Run

对于组中的每个活动序列,我想查找动作对,其中Action1.SerialNo = Action2.SerialNo + k以及它可能发生的次数

Suppose k  = 1, then output will be

FirstAction  NextAction Frequency
Start Run 2
Run Jump  1
Shop Start 1

如果输入表包含数百万个条目,我怎样才能在SQL中快速完成此任务。

3 个答案:

答案 0 :(得分:1)

tful,这应该产生你想要的结果,但我不知道它是否会像你想的那样快。值得一试。

create table Actions(
  GroupId int,
  Serial int,
  "Action" varchar(20) not null,
  primary key (GroupId, Serial)
);

insert into Actions values
  (1,1,'Start'), (1,2,'Run'), (1,3,'Jump'),
  (1,8,'End'), (2,9,'Shop'), (2,10,'Start'),
  (2,11,'Run');
go

declare @k int = 1;
with ActionsDoubled(Serial,Tag,"Action") as (
  select
    Serial, 'a', "Action"
  from Actions as A
  union all
  select
    Serial-@k, 'b', "Action"
  from Actions
  as B
), Pivoted(Serial,a,b) as (
  select Serial,a,b
  from ActionsDoubled
  pivot (
    max("Action") for Tag in ([a],[b])
  ) as P
)
  select 
    a, b, count(*) as ct
    from Pivoted
    where a is not NULL and b is not NULL
    group by a,b
    order by a,b;
go

drop table Actions;

如果要对稳定数据上的各种@k值进行相同的计算,从长远来看,这可能会更好:

declare @k int = 1;
  select 
    Serial, 'a' as Tag, "Action"
  into ActionsDoubled
  from Actions as A
  union all
  select
    Serial-@k, 'b', "Action"
  from Actions
  as B;
go

create unique clustered index AD_S on ActionsDoubled(Serial,Tag);
create index AD_a on ActionsDoubled(Tag,Serial);
go

with Pivoted(Serial,a,b) as (
  select Serial,a,b
  from ActionsDoubled
  pivot (
    max("Action") for Tag in ([a],[b])
  ) as P
)
  select 
    a, b, count(*) as ct
    from Pivoted
    where a is not NULL and b is not NULL
    group by a,b
    order by a,b;
go

drop table ActionsDoubled;

答案 1 :(得分:0)

SELECT a1.Action AS FirstActio, a2.Action AS NextAction, COUNT(*) AS Frequency
FROM Activities a1 JOIN Activities a2
 ON (a1.GroupId = a2.GroupId AND a1.Serial = a2.Serial + @k)
GROUP BY a1.Action, a2.Action;

答案 2 :(得分:0)

问题在于:无论如何,您的查询都必须遍历每一行。

通过将每个组作为单独的查询单独处理,可以使数据库更易于管理。特别是如果每​​组的大小都是小的。

在幕后有很多事情发生,当查询必须扫描整个表格时,实际上最终会比你有效覆盖所有百万行的小块块慢很多倍。

例如:

--Stickler for clean formatting...
SELECT 
    a1.Action AS FirstAction, 
    a2.Action AS NextAction,
    COUNT(*) AS Frequency
FROM 
     Activities a1 JOIN Activities a2
     ON (a1.groupid = a2.groupid 
         AND a1.Serial = a2.Serial + @k)   
WHERE
     a1.groupid = 1
GROUP BY
     a1.Action,
     a2.Action;

顺便说一下,表上有一个索引(GroupId,Serial),对吧?