我有一个包含很多行的数据表,我想有条件地将两列分组,即Begin和End。这些列代表相关人员正在做某事的某个月。这里有一些示例数据(如果你不使用R,可以使用R来读入,或者找到下面的纯表):
# base:
test <- read.table(
text = "
1 A mnb USA prim 4 12
2 A mnb USA x 13 15
3 A mnb USA un 16 25
4 A mnb USA fdfds 1 2
5 B ghf CAN sdg 3 27
6 B ghf CAN hgh 28 29
7 B ghf CAN y 24 31
8 B ghf CAN ghf 38 42
",header=F)
library(data.table)
setDT(test)
names(test) <- c("row","Person","Name","Country","add info","Begin","End")
out <- read.table(
text = "
1 A mnb USA fdfds 1 2
2 A mnb USA - 4 25
3 B ghf CAN - 3 31
4 B ghf CAN ghf 38 42
",header=F)
setDT(out)
names(out) <- c("row","Person","Name","Country","add info","Begin","End")
分组应按如下方式进行:如果A人从第4个月到第15个月进行了徒步旅行并且从第16个月到第24个月进行了远足,我会将连续(即没有休息)的活动从第4个月分组到第24个月。事后A人从第25个月到第28个月进行冲浪,我也会加上这个,整个团体活动将持续4到28个。 现在有问题的是有重叠时期的情况,例如人A也可能从11到31捕鱼,所以整个事情将变成4到31.但是,如果人A做了1到2的事情,那将是一个单独的活动(与1比3相比,也必须加上,因为3连接到4)。我希望这很清楚,如果没有,你可以在上面的代码中找到更多的例子。 我正在使用数据表,因为我的数据集非常大。到目前为止,我已经开始使用sqldf了,但是如果每个人有这么多活动(比方说8个或更多),那就有问题了。 这可以在datatable,plyr或sqldf中完成吗? 请注意:我也在寻找SQL中的答案,因为我可以直接在sqldf中使用它或尝试将其转换为另一种语言。 sqldf支持(1)SQLite后端数据库(默认情况下),(2)H2 java数据库,(3)PostgreSQL数据库和(4)sqldf 0.4-0以后也支持MySQL。
编辑:这是'纯'表:
在:
Person Name Country add info Begin End
A mnb USA prim 4 12
A mnb USA x 13 15
A mnb USA un 16 25
A mnb USA fdfds 1 2
B ghf CAN sdg 3 27
B ghf CAN hgh 28 29
B ghf CAN y 24 31
B ghf CAN ghf 38 42
输出:
A mnb USA fdfds 1 2
A mnb USA - 4 25
B ghf CAN - 3 31
B ghf CAN ghf 38 42
答案 0 :(得分:2)
如果您使用的是SQL Server 2012或更高版本,则可以使用LAG和LEAD函数构建逻辑,以获得最终所需的数据集。我相信,自Oracle 8i以来,Oracle中也提供了这些功能。
以下是我在SQL Server 2012中创建的解决方案,它可以为您提供帮助。您提供的示例值将加载到临时表中,以表示您所称的第一个“纯表”。使用这两个函数以及OVER子句,我使用下面的T-SQL代码到达了您的最终数据集。我在代码中留下了一些注释掉的行,以显示我是如何逐个构建整体解决方案的,这解释了放置在作为分组标志的GapMarker列的CASE语句中的各种场景。
IF OBJECT_ID('tempdb..#MyTable') IS NOT NULL
DROP TABLE #MyTable
CREATE TABLE #MyTable (
Person CHAR(1)
,[Name] VARCHAR(3)
,Country VARCHAR(10)
,add_info VARCHAR(10)
,[Begin] INT
,[End] INT
)
INSERT INTO #MyTable (Person, Name, Country, add_info, [Begin], [End])
VALUES ('A', 'mnb', 'USA', 'prim', 4, 12),
('A', 'mnb', 'USA', 'x', 13, 15),
('A', 'mnb', 'USA', 'un', 16, 25),
('A', 'mnb', 'USA', 'fdfds', 1, 2),
('B', 'ghf', 'CAN', 'sdg', 3, 27),
('B', 'ghf', 'CAN', 'hgh', 28, 29),
('B', 'ghf', 'CAN', 'y', 24, 31),
('B', 'ghf', 'CAN', 'ghf', 38, 42);
WITH CTE
AS
(SELECT
mt.Person
,mt.Name
,mt.Country
,mt.add_info
,mt.[Begin]
,mt.[End]
--,LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
--,CASE WHEN [End] + 1 = LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
-- --AND LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]) = LEAD([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
-- THEN 1
-- ELSE 0
-- END AS Grp
--,MARKER = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End]))
,CASE
WHEN mt.[End] + 1 = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])) OR
1 + COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])) = mt.[Begin] OR
COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])) BETWEEN mt.[Begin] AND mt.[End] OR
[End] BETWEEN LAG([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) AND LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) THEN 1
ELSE 0
END AS GapMarker
,InBetween = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]))
,EndInBtw = LAG([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])
,LagEndInBtw = LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])
FROM #MyTable mt
--ORDER BY mt.Person, mt.[Begin]
)
SELECT DISTINCT
X.Person
,X.[Name]
,X.Country
,t.add_info
,X.MinBegin
,X.MaxEnd
FROM (SELECT
c.Person
,c.[Name]
,c.Country
,c.add_info
,c.[Begin]
,c.[End]
,c.GapMarker
,c.InBetween
,c.EndInBtw
,c.LagEndInBtw
,MIN(c.[Begin]) OVER (PARTITION BY c.Person, c.GapMarker ORDER BY c.Person) AS MinBegin
,MAX(c.[End]) OVER (PARTITION BY c.Person, c.GapMarker ORDER BY c.Person) AS MaxEnd
--, CASE WHEN c.[End]+1 = c.MARKER
-- OR c.MARKER +1 = c.[Begin]
-- THEN 1
-- ELSE 0
-- END Grp
FROM CTE AS c) X
LEFT JOIN #MyTable AS t
ON t.[Begin] = X.[MinBegin]
AND t.[End] = X.[MaxEnd]
AND t.Person = X.Person
ORDER BY X.Person, X.MinBegin
--ORDER BY Person, [Begin]
以下是与您所需的最终数据集匹配的结果的屏幕截图:
答案 1 :(得分:2)
我做了一个在我的测试中工作的那个,几乎所有的主要数据库都应该正常运行...我强调了我的列...请在测试前更改名称:
SELECT
r1.person_,
r1.name_,
r1.country_,
CASE
WHEN max(r2.begin_) = max(r1.begin_)
THEN max(r1.info_) ELSE '-'
END info_,
MAX(r2.begin_) begin_,
r1.end_
FROM stack_39626781 r1
INNER JOIN stack_39626781 r2 ON 1=1
AND r2.person_ = r1.person_
AND r2.begin_ <= r1.begin_ -- just optimizing...
LEFT JOIN stack_39626781 r3 ON 1=1
AND r3.person_ = r1.person_
-- matches when another range overlaps this range end
AND r3.end_ >= r1.end_ + 1
AND r3.begin_ <= r1.end_ + 1
LEFT JOIN stack_39626781 r4 ON 1=1
AND r4.person_ = r2.person_
-- matches when another range overlaps this range begin
AND r4.end_ >= r2.begin_ - 1
AND r4.begin_ <= r2.begin_ - 1
WHERE 1=1
-- get rows
-- with no overlaps on end range and
-- with no overlaps on begin range
AND r3.person_ IS NULL
AND r4.person_ IS NULL
GROUP BY
r1.person_,
r1.name_,
r1.country_,
r1.end_
此查询基于以下事实:输出中的任何范围都没有连接/重叠。可以说,对于五个范围的输出,存在五个begin
和五个end
,没有连接/重叠。查找并关联它们应该比生成所有连接/重叠更容易。那么,这个查询的作用是:
end
值没有重叠/连接; begin
值没有重叠/连接; person
和end
,正确的begin
对是可用的最大值,其值等于或小于此end
...它&#39;很容易验证此规则...首先,如果您有两个或更多可能的begin
,则不能end
大于begin
...小于end
,e。 g。, begin1 = 结束 - 2 和 begin2 = 结束 - 5 ,选择较小的一个( begin2 )使较大的一个( begin1 )成为此范围的重叠。 希望它有所帮助。