从数据框中提取日期并使用R

时间:2016-07-14 17:31:51

标签: regex r datetime ggplot2

我想绘制一个时间序列,用于显示log每小时的数量。我首先尝试将每个date的{​​{1}}与log分开,以便按小时计算dataframe的数量。

我有以下log

dataframe

我希望得到以下[Fri Jun 1 15:56:37 1995] httpd: send aborted for disarray.demon.co.uk [Fri Jun 1 16:29:29 1995] httpd: send aborted for ansc86024.usask.ca [Fri Jun 1 16:31:42 1995] httpd: send aborted for 194.20.24.70 [Fri Jun 1 16:34:11 1995] httpd: send aborted for sw24-70.iol.it [Fri Jun 1 16:41:02 1995] httpd: send aborted for educ026.usask.ca [Fri Jun 1 16:41:13 1995] httpd: send aborted for educ026.usask.ca [Fri Jun 1 16:41:13 1995] httpd: send aborted for sw24-70.iol.it [Fri Jun 1 16:45:07 1995] httpd: send aborted for 128.233.18.38 [Fri Jun 1 17:26:50 1995] httpd: send aborted for pc117c.nwrel.org [Fri Jun 1 17:46:53 1995] httpd: send aborted for geoff.usask.ca [Fri Jun 2 17:57:09 1995] httpd: send aborted for piweba3y.prodigy.com [Fri Jun 2 17:57:50 1995] httpd: send aborted for piweba3y.prodigy.com [Fri Jun 2 18:10:15 1995] httpd: send aborted for 193.74.92.109 [Fri Jun 2 20:14:30 1995] httpd: send aborted for 128.233.13.41 [Fri Jun 2 20:15:59 1995] httpd: send aborted for peter.net4.io.org [Fri Jun 2 21:11:54 1995] httpd: send aborted for ped374.usask.ca 每小时的数字:

enter image description here

我尝试使用log函数添加date列:

gsub

1 个答案:

答案 0 :(得分:1)

这个怎么样:

-c:d

提取日期字符串:

# Data in form of a string vector
dat = c("[Fri Jun 1 15:56:37 1995] httpd: send aborted for disarray.demon.co.uk", 
        "[Fri Jun 1 16:29:29 1995] httpd: send aborted for ansc86024.usask.ca", 
        "[Fri Jun 1 16:31:42 1995] httpd: send aborted for 194.20.24.70", 
        "[Fri Jun 1 16:34:11 1995] httpd: send aborted for sw24-70.iol.it", 
        "[Fri Jun 1 16:41:02 1995] httpd: send aborted for educ026.usask.ca", 
        "[Fri Jun 1 16:41:13 1995] httpd: send aborted for educ026.usask.ca", 
        "[Fri Jun 1 16:41:13 1995] httpd: send aborted for sw24-70.iol.it", 
        "[Fri Jun 1 16:45:07 1995] httpd: send aborted for 128.233.18.38", 
        "[Fri Jun 1 17:26:50 1995] httpd: send aborted for pc117c.nwrel.org", 
        "[Fri Jun 1 17:46:53 1995] httpd: send aborted for geoff.usask.ca", 
        "[Fri Jun 2 17:57:09 1995] httpd: send aborted for piweba3y.prodigy.com", 
        "[Fri Jun 2 17:57:50 1995] httpd: send aborted for piweba3y.prodigy.com", 
        "[Fri Jun 2 18:10:15 1995] httpd: send aborted for 193.74.92.109", 
        "[Fri Jun 2 20:14:30 1995] httpd: send aborted for 128.233.13.41", 
        "[Fri Jun 2 20:15:59 1995] httpd: send aborted for peter.net4.io.org", 
        "[Fri Jun 2 21:11:54 1995] httpd: send aborted for ped374.usask.ca")

library(dplyr)
library(lubridate)

将日期字符串转换为POSIXct日期时间格式:

dat = data.frame(date.string = gsub(".{5}(.*)\\].*", "\\1", dat))

现在,按小时汇总。我们丢弃分钟和秒钟,以便我们可以按日期分组以按小时计算:

dat$date = as.POSIXct(dat$date.string, format= "%b %e %H:%M:%S %Y")
datByHour = dat %>% 
  mutate(date = as.POSIXct(paste0(paste(year(date),month(date),day(date),sep="-"), 
                                  " ", 
                                  paste(hour(date),"00:00", sep=":")))) %>%
  group_by(date) %>%
  tally 

datByHour

小时计数:

                 date     n
1 1995-06-01 15:00:00     1
2 1995-06-01 16:00:00     7
3 1995-06-01 17:00:00     2
4 1995-06-02 17:00:00     2
5 1995-06-02 18:00:00     1
6 1995-06-02 20:00:00     2
7 1995-06-02 21:00:00     1