准备数据以进行异常检测

时间:2016-11-21 19:58:17

标签: r time-series aggregate anomaly-detection

我有一项任务可以对时间序列数据进行异常检测。我有异常检测的代码,但我正在尝试为它准备数据。数据如下所示。

timestampUtc

2016-08-01 14:38:01, 2016-08-01 14:38:06, 2016-08-01 14:38:12, 2016-08-01 14:38:18, 2016-08-01 14:38:22, 2016-08-01 14:38:27, 2016-08-01 14:38:27, 2016-08-01 14:38:30, 2016-08-01 14:38:37, 2016-08-01 14:38:38, 2016-08-01 14:38:38, 2016-08-01 14:38:46, 2016-08-01 14:39:03, 2016-08-01 14:39:03, 2016-08-01 14:39:10, 2016-08-01 14:39:12, 2016-08-01 14:39:15, 2016-08-01 14:39:16, 2016-08-01 14:39:20, 2016-08-01 14:39:28

首先,我想在timestampUtc列中将秒数设为零。接下来,我想创建一个显示count的列,并希望计算该特定分钟的值数。例如,输出应如下所示:

timestampUtc count

2016-08-01 14:38:00 12, 2016-08-01 14:39:00 6, 2016-08-01 14:40:00 8

3 个答案:

答案 0 :(得分:1)

您可以使用as.POSIXct()将字符串转换为日期,使用某种格式忽略秒数,然后使用table进行汇总:

timestampUtc <- c('2016-08-01 14:38:01', '2016-08-01 14:38:06', '2016-08-01 14:38:12', '2016-08-01 14:38:18', '2016-08-01 14:38:22', '2016-08-01 14:38:27', '2016-08-01 14:38:27', '2016-08-01 14:38:30', '2016-08-01 14:38:37', '2016-08-01 14:38:38', '2016-08-01 14:38:38', '2016-08-01 14:38:46', '2016-08-01 14:39:03', '2016-08-01 14:39:03', '2016-08-01 14:39:10', '2016-08-01 14:39:12', '2016-08-01 14:39:15', '2016-08-01 14:39:16', '2016-08-01 14:39:20', '2016-08-01 14:39:28')
timestampUtc <- as.POSIXct(timestampUtc, format="%Y-%m-%d %H:%M", tz="UTC")
table(timestampUtc)
2016-08-01 14:38:00 2016-08-01 14:39:00 
                 12                   8 

答案 1 :(得分:1)

假设您的时间戳已经是POSIXt格式,并且您的时间戳数据存储在df中 -

df$count <- 1
df$timestamp <- format(df$timestamp, format = "%Y-%m-%d %H:%M")
df <- aggregate(count ~ timestamp, data = df, FUN = sum)
names(df) <- c("timestamp", "count")

答案 2 :(得分:1)

POSIXt类的cutseq方法都有breaks(或by)的间隔选项:

 timestampUtc <-scan(text="2016-08-01 14:38:01, 2016-08-01 14:38:06, 2016-08-01 14:38:12, 2016-08-01 14:38:18, 2016-08-01 14:38:22, 2016-08-01 14:38:27, 2016-08-01 14:38:27, 2016-08-01 14:38:30, 2016-08-01 14:38:37, 2016-08-01 14:38:38, 2016-08-01 14:38:38, 2016-08-01 14:38:46, 2016-08-01 14:39:03, 2016-08-01 14:39:03, 2016-08-01 14:39:10, 2016-08-01 14:39:12, 2016-08-01 14:39:15, 2016-08-01 14:39:16, 2016-08-01 14:39:20, 2016-08-01 14:39:28",
                      what="", sep=",")
#Read 20 items

table( cut( as.POSIXct(timestampUtc), breaks="min")  )
#------------
2016-08-01 14:38:00 2016-08-01 14:39:00 
                 12                   8 

如果你想要10或15分钟的间隔,它可能已经过了#10; 10分钟&#34;或&#34; 15分钟&#34;。到目前为止,其他一个答案中删除了输入阶段的信息,我认为这是一个值得怀疑的做法,但code_is_entropy在传递给format的阶段使用了table缩短的格式字符串。