Question

我有大约28,000个条目的数据集。它们由日期/时间戳组成，格式如下

02/21/2014 12:11:24 PM

我想用数据创建一些图表，以便更好地将其可视化。如果有人可以指出我正确的方向，如何制作一个图表，显示在一定时期内的条目数量将是伟大的。该计划用于图表，以确定每小时在数据集的范围内（大约3周）应用了多少人。

因此，如果在2月21日11:00到晚上11:59之间有4个条目，我希望图表在y轴上的值为4。

如果您觉得有更好的平台可以做到这一点，那也将受到赞赏。

Answer 1

## generate data
set.seed(1L);
N <- 28e3L;
dts <- sort(as.POSIXct('2014-02-01')+86400L*(sample(7L*3L,N,T,rep(c(1L,2L,rep(10L,5L)),3L))-1L)+rnorm(N,86400/2,86400/8));

## bucket into hours and table
dts.cut <- cut(dts,'hour');
dts.freq <- table(dts.cut);

## precompute plot parameters
xlim <- range(dts); xlim <- as.POSIXct(c(round(xlim[1L]-86400/2,'day'),round(xlim[2L]+86400/2,'day'))); ## must convert back from POSIXlt to POSIXct, otherwise plot() fails on xlim
xticks.day <- seq(xlim[1L],xlim[2L],'day');
xticks.week <- xticks.day[setdiff(which(weekdays(xticks.day)=='Saturday'),c(1L,length(xticks.day)))];
xticks <- rep(xticks.day,each=3L)+1:3*60*60*6;
ylim <- range(dts.freq); ylim <- c(0,(ylim[2L]+9L)%/%10L*10L);
yticks <- seq(0,ylim[2L],10L);
col <- 'red';

## helper function, from <http://stackoverflow.com/questions/29125019/get-margin-line-locations-mgp-in-user-coordinates>
line2user <- function(line,side) {
    lh <- par('cin')[2L]*par('cex')*par('lheight');
    x1 <- diff(grconvertX(0:1,'inches','user'));
    y1 <- diff(grconvertY(0:1,'inches','user'));
    switch(side,
        `1`=par('usr')[3L]-line*y1*lh,
        `2`=par('usr')[1L]-line*x1*lh,
        `3`=par('usr')[4L]+line*y1*lh,
        `4`=par('usr')[2L]+line*x1*lh,
        stop('side must be 1, 2, 3, or 4',call.=FALSE)
    );
}; ## end line2user()

## draw plot
par(mar=c(5,4,4,2)+0.1+c(2,0,0,0));
plot(NA,xlim=xlim,ylim=ylim,axes=F,xaxs='i',yaxs='i',ann=F);
abline(v=xticks,col='lightgrey');
segments(xticks.day,ylim[2L],y1=line2user(4,1L),col='darkgrey',lwd=2,xpd=NA);
segments(xticks.week,ylim[2L],y1=line2user(4,1L),col='black',lwd=2,xpd=NA);
abline(h=yticks,col='lightgrey');
abline(h=0);
axis(1L,xticks,format(xticks,'%H:00'),las=2L,cex.axis=0.7);
axis(2L,yticks,las=2L,cex.axis=0.7);
mtext('Time',1L,5,font=3L);
mtext('Frequency',2L,2.75,font=3L);
mtext(format(xticks.day[-length(xticks.day)],'%a %b %d'),1L,2.75,at=xticks.day[-length(xticks.day)]+12*60*60,cex=0.7,font=2L);
x <- as.POSIXct(names(dts.freq));
y <- dts.freq;
lines(x,y,col=col,xpd=NA);
points(x,y,pch=16L,cex=0.7,col=col,xpd=NA);
title(paste0('Events per hour, ',format(xlim[1L],'%Y-%m-%d'),' to ',format(xticks.day[length(xticks.day)-1L],'%Y-%m-%d')));

Answer 2

R是完美的。有几个包可能有用。我在下面展示了一些示例数据，并且是最简单的绘图程序之一。您可以在 ggplot2 包中找到可能对您感兴趣的其他图表。

使用 lubridate 包可以更轻松地解析日期。您首先需要导入数据。由于未提供导入数据的示例，我最后提供了一些常规提示。

install.packages("lubridate")
library(lubridate)

生成一些示例数据：

Lubridate有一系列类似功能，使用字母m，d，y，h，m和s 。您可以按许多不同的顺序排列字母，并且包中通常会有一个功能可以解析您的日期。例如，如果您只有日期，例如2014/02/21，那么您将使用ymd()函数。对于您所描述的数据，您需要mdy_hms()。对于导入的数据，您不需要seq()，但它是为了生成示例。

start_date <- mdy_hms("02/21/2014 12:11:24 PM")
end_date <- mdy_hms("02/22/2014 12:11:24 PM")

date.sequence <- seq(start_date,end_date, by = '1 hour')

绘制为直方图并使用频率选项：

这将给出整数，它是y轴上每个bin的计数。没有它，你会有密度，这意味着整个图将被标准化，使整个曲线下的面积等于1.第二个参数称为bin，您可以用20或100之类的数字替换它好。 28,000可能不会给出一个漂亮的图表。

hist(date.sequence, length(date.sequence), freq = TRUE)

关于导入数据的附录：

这不是最初的问题，但也有助于导入。

从CSV文件导入数据。 as.is参数将确保R不使用它的默认方法来解释日期，以便稍后可以使用lubridate包。

all.dates <- read.table( "filename.csv", as.is=TRUE )

然后，根据您的格式从lubridate中选择适当的功能。例如：

all.dates.reformatted <- mdy_hms(all.dates)

根据时间戳

2 个答案: