sql选择删除相邻的重复行并执行时间计算

时间:2014-02-20 23:05:56

标签: sql sql-server sql-server-2012

在######

之后查看我的解决方案

MS sql server 2012。 我需要在下面的Flow列中删除相邻的重复行,并保留第一行(标记为*来说明)。然后对所有行取1和0之间的时间差,得到累计时间。

Record Number    Downhole Time      Flow
-------------------------------------------
0        03/27/2013 19:23:48.582    1       *
58       03/27/2013 19:28:12.606    1   
137      03/27/2013 19:32:16.070    0       *
143      03/27/2013 19:33:59.070    0   
255      03/27/2013 19:40:14.070    0   
272      03/29/2013 14:43:55.071    1       *
289      03/29/2013 14:45:44.070    1   
293      03/29/2013 14:45:59.071    0       *
294      03/29/2013 14:46:10.070    0   

删除相邻的结果

Record Number    Downhole Time      Flow
-------------------------------------------
0        03/27/2013 19:23:48.582    1       *
137      03/27/2013 19:32:16.070    0       *
272      03/29/2013 14:43:55.071    1       *
293      03/29/2013 14:45:59.071    0       *

最终预期结果 ,累积时差=(2013年3月27日19:32:16.070 - 03/27/2013 19:23:48.582)+( 03/29/2013 14:45:59.071 - 03/29/2013 14:43:55.071)+如果有更多行。

解决方案 #######以下内容在SQL编辑器中看起来好多了,只需将其粘贴即可

WITH FlowEvntTable AS (
   /* the following gets raw data and adds Row# for the next select to use*/
   Select
      ROW_NUMBER() OVER (ORDER BY [Downhole Time]) AS RNum,
      [Downhole Time],
      [Record Number],
      Value As Flow
   FROM [newMDF].[dbo].[vLog]
   where
      [Event Name] like 'Flow%'
      AND [Field Name] like 'Flow'
), 
NoDuplicatesFlowTable AS (
/*the following line came from StackOverflow "ignore adjacent matching rows" */
Select [Downhole Time], [Flow] from FlowEvntTable A where A.RNum NOT IN (SELECT A.RNUM from FlowEvntTable A JOIN FlowEvntTable B ON B.RNum +1 = A.RNum AND B.Flow=A.Flow) 
), 
FlowOffColAddedTable AS (
Select *, lead([Downhole Time]) OVER (ORDER BY [Downhole Time]) AS NotFlowTime from NoDuplicatesFlowTable 
), 
FlowStartEndTimeTable AS (
/*select above adds time offest by 1 row to a new column. now by Flow = 1/On, you get Start End On pairs */
Select [Downhole Time] AS StartTime, NotFlowTime AS EndTime from FlowOffColAddedTable where Flow = 1
)

/*diff and sum the pairs*/
Select Sum(DATEDIFF(ms,StartTime,EndTime))/1000 AS VibeOnSec From 
 FlowStartEndTimeTable 

" Select *,lead ..."之后的中间结果上方。 是的,它与上述数据不匹配,只是为了给出一个粗略的想法。

Downhole Time         Flow  NotFlowTime  
-------------------------------------------  
2013-03-28 00:23:48.0000000 1   2013-03-28 00:32:16.0000000  
2013-03-28 00:32:16.0000000 0   2013-03-28 00:33:59.0000000  
2013-03-28 00:33:59.0000000 1   2013-03-28 00:40:14.0000000  
2013-03-28 00:40:14.0000000 0   2013-03-29 19:43:55.0000000  
2013-03-29 19:43:55.0000000 1   2013-03-29 19:45:44.0000000  

3 个答案:

答案 0 :(得分:0)

不确定您使用的是哪种数据库。这是一个具有分析功能和Oracle的解决方案:

SELECT 
  un, 
  mytime, 
  flow,
  lead (mytime) OVER (ORDER BY UN) lead_time,
 (lead (mytime) OVER (ORDER BY UN) - mytime)*24*60 minutes
  FROM (  SELECT un,
                 mytime,
                 flow,
                 LAG (flow) OVER (ORDER BY UN) lag_val
            FROM test
        ORDER BY un) a
 WHERE a.flow != NVL (a.lag_val, 9999)

内部选择使用LAG分析函数获取前一个流的值。外部选择的where子句过滤“重复”流(仅留下更改的rist事件)。外部选择还使用LEAD分析函数计算时间差(以分钟为单位)。尽管您拥有大量数据,但这将是非常好的性能。 让我知道您正在使用什么类型的数据库 - 大多数数据库都有分析函数实现(或解决方法)......这只适用于Orace。

答案 1 :(得分:0)

我相信这可以完成你所要求的工作:

WITH FlowIntervals AS (
   SELECT
      FromTime = Min(D.[Downhole Time]),
      X.ToTime
   FROM
      dbo.vLog D
      OUTER APPLY (
         SELECT TOP 1 ToTime = D2.[Downhole Time]
         FROM dbo.vLog D2
         WHERE
            D.[Downhole Time] < D2.[Downhole Time]
            AND D.[Flow] <> D2.[Flow]
         ORDER BY D2.[Downhole Time]
      ) X
   WHERE D.Flow = 1
   GROUP BY X.ToTime
)
SELECT Sum(DateDiff(ms, FromTime, IsNull(ToTime, GetDate())) / 1000.0)
FROM FlowIntervals
;

此查询适用于SQL 2005及更高版本。它会表现得很好,但需要vLog表的自联接,因此它的性能可能不如使用LEADLAG的解决方案。

如果您正在寻找绝对最佳的性能,此查询可能会起到作用:

WITH Ranks AS (
   SELECT
      Grp =
         Row_Number() OVER (ORDER BY [Downhole Time])
         - Row_Number() OVER (PARTITION BY Flow ORDER BY [Downhole Time]),
      [Downhole Time],
      Flow
   FROM dbo.vLog
), Ranges AS (
   SELECT
      Result = Row_Number() OVER (ORDER BY Min(R.[Downhole Time]), X.Num) / 2,
      [Downhole Time] = Min(R.[Downhole Time]),
      R.Flow, X.Num
   FROM
      Ranks R
      CROSS JOIN (SELECT 1 UNION ALL SELECT 2) X (Num)
   GROUP BY
      R.Flow, R.Grp, X.Num
), FlowStates AS (
   SELECT
      FromTime = Min([Downhole Time]),
      ToTime = CASE WHEN Count(*) = 1 THEN NULL ELSE Max([Downhole Time]) END,
      Flow = IsNull(Min(CASE WHEN Num = 2 THEN Flow ELSE NULL END), Min(Flow))
   FROM Ranges R
   WHERE Result > 0
   GROUP BY Result
)
SELECT
   ElapsedSeconds =
      Sum(DateDiff(ms, FromTime, IsNull(ToTime, GetDate())) / 1000.0)
FROM
   FlowStates
WHERE
   Flow = 1
;

使用您的示例数据,它返回631.486000(秒)。如果只选择FlowStates CTE中的行,则会得到以下结果:

FromTime                ToTime                  Flow
----------------------- ----------------------- ----
2013-03-27 19:23:48.583 2013-03-27 19:32:16.070 1
2013-03-27 19:32:16.070 2013-03-29 14:43:55.070 0
2013-03-29 14:43:55.070 2013-03-29 14:45:59.070 1
2013-03-29 14:45:59.070 NULL                    0

此查询在SQL 2005及更高版本中运行,并且应该与任何其他解决方案(包括使用LEADLAG(以偷偷摸摸的方式模拟)的解决方案在性能方面非常好地叠加。我不承诺它会赢,但它可以做得很好,毕竟可能会赢。

有关查询内容的详细信息,请参阅this answer to a similar question

最后,对于完整的解决方案,这里是SQL Server的滞后/潜在客户版本:

WITH StateChanges AS (
   SELECT
      [Downhole Time],
      Flow,
      Lag(Flow) OVER (ORDER BY [Downhole Time]) PrevFlow
   FROM
      dbo.vLog
), Durations AS (
   SELECT
      [Downhole Time], 
      Lead([Downhole Time]) OVER (ORDER BY [Downhole Time]) NextTime,
      Flow
   FROM
      StateChanges
   WHERE
      Flow <> PrevFlow
      OR PrevFlow IS NULL
)
SELECT ElapsedTime = Sum(DateDiff(ms, [Downhole Time], NextTime) / 1000.0)
FROM Durations
WHERE Flow = 1
;

此查询需要SQL Server 2012或更高版本。它计算状态变化(流量变化?),然后选择流量确实变化的那些变量,然后最终计算流量从0变为1(流量开始)的持续时间。

我很想知道这个查询的I / O和时间与其他查询的实际性能结果。如果你只看执行计划,这个查询似乎会赢 - 但它可能不是真正的性能统计数据的明显赢家。

答案 2 :(得分:0)

在我的问题之后我发布了答案(请参阅上面的######之后的解决方案)。谢谢大家的花絮。

PS我试图想出堆栈溢出编辑器/系统,因此我的答案在问题发生后一段时间,在同一个地方,抱歉。