分组连续范围

时间:2016-09-21 21:21:53

标签: sql r range data.table plyr

我有一个包含很多行的数据表,我想有条件地将两列分组,即Begin和End。这些列代表相关人员正在做某事的某个月。这里有一些示例数据(如果你不使用R,可以使用R来读入,或者找到下面的纯表):

# base:
test <- read.table(
text = "
1   A   mnb USA prim    4   12
2   A   mnb USA x   13  15
3   A   mnb USA un  16  25
4   A   mnb USA fdfds   1   2
5   B   ghf CAN sdg 3   27
6   B   ghf CAN hgh 28  29
7   B   ghf CAN y   24  31
8   B   ghf CAN ghf 38  42
",header=F)
library(data.table)
setDT(test)
names(test) <-  c("row","Person","Name","Country","add info","Begin","End")
out <- read.table(
text = "
1   A   mnb USA fdfds   1   2
2   A   mnb USA -   4   25
3   B   ghf CAN -   3   31
4   B   ghf CAN ghf 38  42
",header=F)
setDT(out)
names(out) <- c("row","Person","Name","Country","add info","Begin","End")

分组应按如下方式进行:如果A人从第4个月到第15个月进行了徒步旅行并且从第16个月到第24个月进行了远足,我会将连续(即没有休息)的活动从第4个月分组到第24个月。事后A人从第25个月到第28个月进行冲浪,我也会加上这个,整个团体活动将持续4到28个。 现在有问题的是有重叠时期的情况,例如人A也可能从11到31捕鱼,所以整个事情将变成4到31.但是,如果人A做了1到2的事情,那将是一个单独的活动(与1比3相比,也必须加上,因为3连接到4)。我希望这很清楚,如果没有,你可以在上面的代码中找到更多的例子。 我正在使用数据表,因为我的数据集非常大。到目前为止,我已经开始使用sqldf了,但是如果每个人有这么多活动(比方说8个或更多),那就有问题了。 这可以在datatable,plyr或sqldf中完成吗? 请注意:我也在寻找SQL中的答案,因为我可以直接在sqldf中使用它或尝试将其转换为另一种语言。 sqldf支持(1)SQLite后端数据库(默认情况下),(2)H2 java数据库,(3)PostgreSQL数据库和(4)sqldf 0.4-0以后也支持MySQL。

编辑:这是'纯'表:

在:

Person Name Country add info  Begin End
A      mnb  USA     prim      4      12
A      mnb  USA     x         13     15
A      mnb  USA     un        16     25
A      mnb  USA     fdfds     1      2
B      ghf  CAN     sdg       3      27
B      ghf  CAN     hgh       28     29
B      ghf  CAN     y         24     31
B      ghf  CAN     ghf       38     42

输出:

A      mnb  USA     fdfds     1      2
A      mnb  USA     -         4      25
B      ghf  CAN     -         3      31
B      ghf  CAN     ghf       38     42

2 个答案:

答案 0 :(得分:2)

如果您使用的是SQL Server 2012或更高版本,则可以使用LAG和LEAD函数构建逻辑,以获得最终所需的数据集。我相信,自Oracle 8i以来,Oracle中也提供了这些功能。

以下是我在SQL Server 2012中创建的解决方案,它可以为您提供帮助。您提供的示例值将加载到临时表中,以表示您所称的第一个“纯表”。使用这两个函数以及OVER子句,我使用下面的T-SQL代码到达了您的最终数据集。我在代码中留下了一些注释掉的行,以显示我是如何逐个构建整体解决方案的,这解释了放置在作为分组标志的GapMarker列的CASE语句中的各种场景。

IF OBJECT_ID('tempdb..#MyTable') IS NOT NULL
	DROP TABLE #MyTable

CREATE TABLE #MyTable (
	 Person CHAR(1)
	,[Name] VARCHAR(3)
	,Country VARCHAR(10)
	,add_info VARCHAR(10)
	,[Begin] INT
	,[End] INT
)

INSERT INTO #MyTable (Person, Name, Country, add_info, [Begin], [End])
	VALUES ('A', 'mnb', 'USA', 'prim', 4, 12),
	('A', 'mnb', 'USA', 'x', 13, 15),
	('A', 'mnb', 'USA', 'un', 16, 25),
	('A', 'mnb', 'USA', 'fdfds', 1, 2),
	('B', 'ghf', 'CAN', 'sdg', 3, 27),
	('B', 'ghf', 'CAN', 'hgh', 28, 29),
	('B', 'ghf', 'CAN', 'y', 24, 31),
	('B', 'ghf', 'CAN', 'ghf', 38, 42);

WITH CTE
AS
(SELECT
		mt.Person
		,mt.Name
		,mt.Country
		,mt.add_info
		,mt.[Begin]
		,mt.[End]
		--,LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
		--,CASE WHEN [End] + 1 = LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
		--      --AND LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]) = LEAD([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
		--	  THEN 1
		--	  ELSE 0
		--  END AS Grp
		--,MARKER = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End]))
		,CASE
			WHEN mt.[End] + 1 = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])) OR
				1 + COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])) = mt.[Begin] OR
				COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])) BETWEEN mt.[Begin] AND mt.[End] OR
				[End] BETWEEN LAG([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) AND LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) THEN 1
			ELSE 0
		END AS GapMarker
		,InBetween = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]))
		,EndInBtw = LAG([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])
		,LagEndInBtw = LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])
	FROM #MyTable mt
--ORDER BY mt.Person, mt.[Begin]
)
SELECT DISTINCT
	X.Person
	,X.[Name]
	,X.Country
	,t.add_info
	,X.MinBegin
	,X.MaxEnd
FROM (SELECT
		c.Person
		,c.[Name]
		,c.Country
		,c.add_info
		,c.[Begin]
		,c.[End]
		,c.GapMarker
		,c.InBetween
		,c.EndInBtw
		,c.LagEndInBtw
		,MIN(c.[Begin]) OVER (PARTITION BY c.Person, c.GapMarker ORDER BY c.Person) AS MinBegin
		,MAX(c.[End]) OVER (PARTITION BY c.Person, c.GapMarker ORDER BY c.Person) AS MaxEnd
	--, CASE WHEN c.[End]+1 = c.MARKER
	--        OR c.MARKER +1 = c.[Begin] 
	--  THEN 1
	--  ELSE 0
	--  END Grp
	FROM CTE AS c) X
LEFT JOIN #MyTable AS t
	ON t.[Begin] = X.[MinBegin]
		AND t.[End] = X.[MaxEnd]
		AND t.Person = X.Person
ORDER BY X.Person, X.MinBegin
--ORDER BY Person, [Begin]

以下是与您所需的最终数据集匹配的结果的屏幕截图:

enter image description here

答案 1 :(得分:2)

我做了一个在我的测试中工作的那个,几乎所有的主要数据库都应该正常运行...我强调了我的列...请在测试前更改名称:

SELECT 
  r1.person_,
  r1.name_,
  r1.country_,
  CASE  
    WHEN max(r2.begin_) = max(r1.begin_) 
    THEN max(r1.info_) ELSE '-' 
  END info_,
  MAX(r2.begin_) begin_, 
  r1.end_ 
FROM stack_39626781 r1
INNER JOIN stack_39626781 r2 ON 1=1
  AND r2.person_ = r1.person_
  AND r2.begin_ <= r1.begin_ -- just optimizing...
LEFT JOIN stack_39626781 r3 ON 1=1
  AND r3.person_ = r1.person_
  -- matches when another range overlaps this range end
  AND r3.end_   >= r1.end_ + 1
  AND r3.begin_ <= r1.end_ + 1
LEFT JOIN stack_39626781 r4 ON 1=1
  AND r4.person_ = r2.person_
  -- matches when another range overlaps this range begin
  AND r4.end_   >= r2.begin_ - 1
  AND r4.begin_ <= r2.begin_ - 1
WHERE 1=1
  -- get rows 
  -- with no overlaps on end range and
  -- with no overlaps on begin range
  AND r3.person_ IS NULL
  AND r4.person_ IS NULL
GROUP BY
  r1.person_,
  r1.name_,
  r1.country_,
  r1.end_ 

此查询基于以下事实:输出中的任何范围都没有连接/重叠。可以说,对于五个范围的输出,存在五个begin和五个end,没有连接/重叠。查找并关联它们应该比生成所有连接/重叠更容易。那么,这个查询的作用是:

  1. 查找每个人的所有范围,其end值没有重叠/连接;
  2. 查找每个人的所有范围,其begin值没有重叠/连接;
  3. 这些是有效范围,因此将它们全部关联以找到正确的对;
  4. 对于每个personend,正确的begin对是可用的最大值,其值等于或小于此end ...它&#39;很容易验证此规则...首先,如果您有两个或更多可能的begin,则不能end大于begin ...小于end,e。 g。, begin1 = 结束 - 2 begin2 = 结束 - 5 ,选择较小的一个( begin2 )使较大的一个( begin1 )成为此范围的重叠。
  5. 希望它有所帮助。