编辑:我现在也在寻找其他编程语言的解决方案。
根据other question I asked,我有一个这样的数据集(对于R用户,下面是dput),它代表用户计算机会话:
username machine start end
1 user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
2 user1 D5599.domain.com 2011-01-03 09:46:29 2011-01-03 10:09:16
3 user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4 user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5 user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6 user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7 user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8 user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9 user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10 USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11 USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12 USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13 USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14 USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
15 USER2 D5682.domain.com 2011-02-14 14:59:23 2011-02-14 15:14:47
>
同一台计算机上的同一用户名可能有多个并发(基于时间重叠)会话。如何删除这些行,以便只有一个会话 留给这个数据?原始数据集大约有。 50万行。
预期输出为(第2,15行已删除)
username machine start end
1 user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
3 user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4 user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5 user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6 user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7 user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8 user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9 user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10 USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11 USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12 USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13 USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14 USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
>
这是数据集:
structure(list(username = c("user1", "user1", "user1",
"user1", "user1", "user1", "user1", "user1",
"user1", "USER2", "USER2", "USER2", "USER2", "USER2", "USER2"
), machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("D5599.domain.com", "D5645.domain.com",
"D5682.domain.com", "D5686.domain.com", "D5694.domain.com", "D5696.domain.com",
"D5772.domain.com", "D5772.domain.com", "D5847.domain.com", "D5855.domain.com",
"D5871.domain.com", "D5927.domain.com", "D5927.domain.com", "D5952.domain.com",
"D5993.domain.com", "D6012.domain.com", "D6048.domain.com", "D6077.domain.com",
"D5688.domain.com", "D5815.domain.com", "D6106.domain.com", "D6128.domain.com"
), class = "factor"), start = structure(c(1294040658, 1294040789,
1294056456, 1294232597, 1297686819, 1298462070, 1300695018, 1307603561,
1294049025, 1294835165, 1295269579, 1295356050, 1295961655, 1297688580,
1297688363), class = c("POSIXct", "POSIXt"), tzone = ""), end =
structure(c(1294040847,
1294042156, 1294059377, 1294233795, 1297687216, 1298462303, 1300696342,
1307606339, 1294050583, 1294835573, 1295271862, 1295358163, 1295961878,
1297688863, 1297689287), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("username",
"machine", "start", "end"), row.names = c(NA, 15L), class = "data.frame")
答案 0 :(得分:3)
试用intervals包:
library(intervals)
f <- function(dd) with(dd, {
r <- reduce(Intervals(cbind(start, end)))
data.frame(username = username[1],
machine = machine[1],
start = structure(r[, 1], class = class(start)),
end = structure(r[, 2], class = class(end)))
})
do.call("rbind", by(d, d[1:2], f))
使用样本数据,这将15行减少到以下13行(通过组合原始数据帧中的行1和2以及行12和13):
username machine start end
1 user1 D5599.domain.com 2011-01-03 02:44:18 2011-01-03 03:09:16
2 user1 D5599.domain.com 2011-01-03 07:07:36 2011-01-03 07:56:17
3 user1 D5599.domain.com 2011-01-05 08:03:17 2011-01-05 08:23:15
4 user1 D5599.domain.com 2011-02-14 07:33:39 2011-02-14 07:40:16
5 user1 D5599.domain.com 2011-02-23 06:54:30 2011-02-23 06:58:23
6 user1 D5599.domain.com 2011-03-21 04:10:18 2011-03-21 04:32:22
7 user1 D5645.domain.com 2011-06-09 03:12:41 2011-06-09 03:58:59
8 user1 D5682.domain.com 2011-01-03 05:03:45 2011-01-03 05:29:43
9 USER2 D5682.domain.com 2011-01-12 07:26:05 2011-01-12 07:32:53
10 USER2 D5682.domain.com 2011-01-17 08:06:19 2011-01-17 08:44:22
11 USER2 D5682.domain.com 2011-01-18 08:07:30 2011-01-18 08:42:43
12 USER2 D5682.domain.com 2011-01-25 08:20:55 2011-01-25 08:24:38
13 USER2 D5682.domain.com 2011-02-14 07:59:23 2011-02-14 08:14:47
答案 1 :(得分:1)
一种解决方案是首先拆分间隔,使它们有时相等但从不部分重叠,并删除重复项。 问题是我们留下了许多小的邻接间隔,并且合并它们并不简单。
library(reshape2)
library(sqldf)
d$machine <- as.character( d$machine ) # Duplicated levels...
ddply( d, c("username", "machine"), function (u) {
# For each username and machine,
# compute all the possible non-overlapping intervals
intervals <- sort(unique( c(u$start, u$end) ))
intervals <- data.frame(
start = intervals[-length(intervals)],
end = intervals[-1]
)
# Only retain those actually in the data
u <- sqldf( "
SELECT DISTINCT u.username, u.machine,
intervals.start, intervals.end
FROM u, intervals
WHERE u.start <= intervals.start
AND intervals.end <= u.end
" )
# We have non-overlapping, but potentially abutting intervals:
# ideally, we should merge them, but I do not see an easy
# way to do so.
u
} )
编辑:另一个概念上更清晰的解决方案是修复非合并的邻接间隔问题,即计算每个用户和机器的打开会话数:当它停止为零时,用户已登录(有一个或多个会话),当它下降到零时,用户已关闭所有他/她的会话。
ddply( d, c("username", "machine"), function (u) {
a <- rbind(
data.frame( time = min(u$start) - 1, sessions = 0 ),
data.frame( time = u$start, sessions = 1 ),
data.frame( time = u$end, sessions = -1 )
)
a <- a[ order(a$time), ]
a$sessions <- cumsum(a$sessions)
a$previous <- c( 0, a$sessions[ - nrow(a) ] )
a <- a[ a$previous == 0 & a$sessions > 0 |
a$previous > 0 & a$sessions == 0, ]
a$previous_time <- a$time
a$previous_time[-1] <- a$time[ -nrow(a) ]
a <- a[ a$previous > 0 & a$sessions == 0, ]
a <- data.frame(
username = u$username[1],
machine = u$machine[1],
start = a$previous_time,
end = a$time
)
a
} )
答案 2 :(得分:1)
使用interval
中的lubridate
类替代解决方案。
library(lubridate)
int <- with(d, new_interval(start, end))
现在我们需要一个测试重叠的函数。请参阅Determine Whether Two Date Ranges Overlap。
int_overlaps <- function(int1, int2)
{
(int_start(int1) <= int_end(int2)) &
(int_start(int2) <= int_end(int1))
}
现在在所有间隔对上调用它。
index <- combn(seq_along(int), 2)
overlaps <- int_overlaps(int[index[1, ]], int[index[2, ]])
重叠的行:
int[index[1, overlaps]]
int[index[2, overlaps]]
要删除的行只是index[2, overlaps]
。
答案 3 :(得分:1)
伪码解:O(n log n),O(n),如果已知数据已正确排序。
首先按用户,按机器和开始时间对数据进行排序(以便将给定计算机上给定用户的所有行组合在一起,并且每个组中的行按开始时间的升序排列)。
将“工作间隔”初始化为null / nil / undef / etc。
按顺序排列每一行:
最后,如果存在工作间隔,则输出它。
答案 4 :(得分:1)
不知道这是不是你想要的,或者它是否比你现有的更好。它是一个PowerShell解决方案,它使用带有密钥的哈希表,密钥是用户名和计算机名的组合。值是开始和结束时间的哈希值。
如果密钥(会话)已存在,则更新结束时间。如果没有,则创建一个并设置开始时间和初始结束时间。当它在日志中遇到该用户/计算机的新会话记录时,它会更新会话密钥的结束时间。
$ht = @{}
import-csv <logfile> |
foreach{
$key = $_.username + $_.computername
if ($ht.ContainsKey($key)){$ht.$key.end = $_.end}
else{$ht.add("$key",@{start=$_.start;end=$_.end}}
}
完成后,您需要将用户名和计算机名重新分开。