计数数据框

时间:2015-10-05 14:56:05

标签: r count statistics apply

我有我使用的数据,即计数数据,即每个日期+时间组合代表一个数据点。所以我目前的数据框如下:

SELECT p.pcid,p.pc_name,p.pc_image_name
FROM pc p 
LEFT JOIN pc_goods pcm 
  on p.pcid = pcm.pcid AND pcm.media_type = 'I' 
 AND pcm.act_ind = 'Y' AND pcm.fea_ind = 'Y' 
INNER JOIN (SELECT max(pc_image_name) maxImage, pcID from PC_GOODS group by PCID) PCM2
 on PCM2.maxImage=PCM.Pc_image_name 
and PCM2.PCID = PCM2.PCID
INNER JOIN pc_cat pcat 
  ON p.category_id = pcat.cat_id 
  and p.fea_ind = 'Y' AND p.act_ind = 'Y'
order by pcid

现在我想要一个新的DF来计算特定日期每小时有多少数据点。如下所示:

  DATE        TIME
1 2014-02-15  15:02
2 2014-02-15  15:12
3 2014-04-15  02:02
4 2014-05-15  11:02
5 2014-06-15  15:42
6 2014-06-15  16:02
....

我想这样做,以便我可以制作一个x =一天中的小时图,y =数据点数(超过一年)。试图用嵌套的for循环来做,但它没有用。

编辑:如果可能,没有数据点的日期/小时组合应该在数据框中,但COUNT = 0。

4 个答案:

答案 0 :(得分:1)

这是你在找什么?

options(stringsAsFactors = F)

data = read.table(text  = 
"                  1 2014-02-15  15:02
                   2 2014-02-15  15:12
                   3 2014-04-15  02:02
                   4 2014-05-15  11:02
                   5 2014-06-15  15:42
                   6 2014-06-15  16:02")


colnames(data) = c("index", "date", "time")

table(data$date)

 # 2014-02-15 2014-04-15 2014-05-15 2014-06-15 
 #     2          1          1          2 

table(data$date, data$time)

fz = table(data$date, substr(data$time, 1,2))
print(fz)   

 #            02 11 15 16
 # 2014-02-15  0  0  2  0
 # 2014-04-15  1  0  0  0
 # 2014-05-15  0  1  0  0
 # 2014-06-15  0  0  1  1

如果您想重塑数据,可以执行以下操作:

library(reshape)

otherFormat = melt(fz)
colnames(otherFormat) = c("date","hour", "frequency")

print(otherFormat)

#          date hour frequency
# 1  2014-02-15    2         0
# 2  2014-04-15    2         1
# 3  2014-05-15    2         0
# 4  2014-06-15    2         0
# 5  2014-02-15   11         0
# 6  2014-04-15   11         0
# 7  2014-05-15   11         1
# 8  2014-06-15   11         0
# 9  2014-02-15   15         2
# 10 2014-04-15   15         0
# 11 2014-05-15   15         0
# 12 2014-06-15   15         1
# 13 2014-02-15   16         0
# 14 2014-04-15   16         0
# 15 2014-05-15   16         0
# 16 2014-06-15   16         1

答案 1 :(得分:1)

IMO,最易读的方式:

已修改以回答您更新的问题

library(dplyr)
library(stringr)

df <- date.data %>%
  group_by(
    DATE = as.Date(DATE), 
    HOUR = as.numeric(str_sub(TIME, 1, 2))
    ) %>%
  tally 

# create a data frame with all dates/hours
expand.grid(
  # include all dates from first to last
  DATE = seq.Date(min(df$DATE), max(df$DATE), "day"),
  HOUR = 0:23
) %>% 
  arrange(DATE) %>%
  left_join(df, by = c("DATE", "HOUR"))

答案 2 :(得分:1)

附加选项如下。首先,在mutate()中创建一小时列。然后,您计算DATEhourcount()存在的数据点数。取消组合数据后,可以连接两个数据框以创建所需的结果。 expand.grid()部分创建DATE和小时(00到23)的所有组合。由于你有02 for 2,我使用了c(paste0("0", 0:9), 10:23))。最后,在最终mutate()中将NA替换为0。

library(dplyr)
library(stringi)
library(data.table)

mutate(mydf, DATE, hour = stri_extract_first(TIME, regex = "\\d+")) %>%
count(DATE, hour) %>%
ungroup %>%
right_join(expand.grid(DATE = unique(.$DATE),
                       hour = c(paste0("0", 0:9), 10:23))) %>%
mutate(n = replace(n, is.na(n), 0))

# A bit of outcome
#         DATE hour n
#1  2014-02-15   00 0
#2  2014-04-15   00 0
#3  2014-05-15   00 0
#4  2014-06-15   00 0
#5  2014-02-15   01 0

使用data.table,您可以执行相同的操作。您为hour创建了一列,并按DATEhour计算数据点数。然后,您希望将temp与数据表合并,该数据表具有DATE和小时(00到23)的所有组合。您可以使用CJ()创建数据表。完成合并过程后,将NA替换为count {0)列中的total

setDT(mydf)[, hour := stri_extract_first(TIME, regex = "\\d+")][,
            list(total = .N), by = list(DATE, hour)] -> temp

merge(temp,
      CJ(DATE = unique(mydf$DATE), hour = c(paste0("0", 0:9), 10:23)),
      by = c("DATE", "hour"), all = TRUE)[, total := replace(total, is.na(total), 0)][]

#          DATE hour total
# 1: 2014-02-15   02     0
# 2: 2014-02-15   11     0
# 3: 2014-02-15   15     2
# 4: 2014-02-15   16     0
# 5: 2014-02-15   00     0

数据

mydf <- structure(list(DATE = structure(c(16116, 16116, 16175, 16205, 
16236, 16236), class = "Date"), TIME = structure(c(3L, 4L, 1L, 
2L, 5L, 6L), .Label = c("02:02", "11:02", "15:02", "15:12", "15:42", 
"16:02"), class = "factor")), class = "data.frame", .Names = c("DATE", 
"TIME"), row.names = c(NA, -6L))

答案 3 :(得分:0)

您可以通过几种方式执行此操作,但我怀疑最简单的方法是使用table。使用'table',您可以返回日期的频率。这基本上只是数据框中日期的计数。

提取小时后你可以做同样的事情 - 你甚至可以通过table(DF$DATE,DF$HOUR)来嵌套它。使用as.data.frame可以获得与您要查找的内容类似的列表。

已编辑添加:在回答您对问题的修改时,您可以使用factor级别来获取table语句中的零级别。 table通过将它们包含在输出中来尊重您的因子级别,即使它未在输入中找到(事实上,我相信table强制输入背面的因子)。

示例代码:

# Set options and load example data
options(stringsAsFactors = FALSE)
date.data <- data.frame(DATE = c("2014-02-15","2014-02-15","2014-04-15","2014-05-15","2014-06-15","2014-06-15"),
                        TIME = c("15:02","15:12","02:02","11:02","15:42","16:02"))

# Extract the hour
date.data$HOUR <- sapply(X = strsplit(x = date.data$TIME,split = ":"),FUN = `[[`,1)

# Now, set the hours as a factor level - this will allow table() to fill the data in as you are requesting
date.data$HOUR <- factor(x = date.data$HOUR,
                         levels = c("00","01","02","03","04","05",
                                    "06","07","08","09","10","11",
                                    "12","13","14","15","16","17",
                                    "18","19","20","21","22","23"),
                         labels = c("00","01","02","03","04","05",
                                    "06","07","08","09","10","11",
                                    "12","13","14","15","16","17",
                                    "18","19","20","21","22","23"))

# Obtain the first table of interest
as.data.frame(table(date.data$DATE))

        Var1 Freq
1 2014-02-15    2
2 2014-04-15    1
3 2014-05-15    1
4 2014-06-15    2

# And the second table
as.data.frame(table(date.data$DATE,date.data$HOUR))

         Var1 Var2 Freq
1  2014-02-15   00    0
2  2014-04-15   00    0
3  2014-05-15   00    0
4  2014-06-15   00    0
5  2014-02-15   01    0
6  2014-04-15   01    0
7  2014-05-15   01    0
8  2014-06-15   01    0
....