我有我使用的数据,即计数数据,即每个日期+时间组合代表一个数据点。所以我目前的数据框如下:
SELECT p.pcid,p.pc_name,p.pc_image_name
FROM pc p
LEFT JOIN pc_goods pcm
on p.pcid = pcm.pcid AND pcm.media_type = 'I'
AND pcm.act_ind = 'Y' AND pcm.fea_ind = 'Y'
INNER JOIN (SELECT max(pc_image_name) maxImage, pcID from PC_GOODS group by PCID) PCM2
on PCM2.maxImage=PCM.Pc_image_name
and PCM2.PCID = PCM2.PCID
INNER JOIN pc_cat pcat
ON p.category_id = pcat.cat_id
and p.fea_ind = 'Y' AND p.act_ind = 'Y'
order by pcid
现在我想要一个新的DF来计算特定日期每小时有多少数据点。如下所示:
DATE TIME
1 2014-02-15 15:02
2 2014-02-15 15:12
3 2014-04-15 02:02
4 2014-05-15 11:02
5 2014-06-15 15:42
6 2014-06-15 16:02
....
我想这样做,以便我可以制作一个x =一天中的小时图,y =数据点数(超过一年)。试图用嵌套的for循环来做,但它没有用。
编辑:如果可能,没有数据点的日期/小时组合应该在数据框中,但COUNT = 0。
答案 0 :(得分:1)
这是你在找什么?
options(stringsAsFactors = F)
data = read.table(text =
" 1 2014-02-15 15:02
2 2014-02-15 15:12
3 2014-04-15 02:02
4 2014-05-15 11:02
5 2014-06-15 15:42
6 2014-06-15 16:02")
colnames(data) = c("index", "date", "time")
table(data$date)
# 2014-02-15 2014-04-15 2014-05-15 2014-06-15
# 2 1 1 2
table(data$date, data$time)
fz = table(data$date, substr(data$time, 1,2))
print(fz)
# 02 11 15 16
# 2014-02-15 0 0 2 0
# 2014-04-15 1 0 0 0
# 2014-05-15 0 1 0 0
# 2014-06-15 0 0 1 1
如果您想重塑数据,可以执行以下操作:
library(reshape)
otherFormat = melt(fz)
colnames(otherFormat) = c("date","hour", "frequency")
print(otherFormat)
# date hour frequency
# 1 2014-02-15 2 0
# 2 2014-04-15 2 1
# 3 2014-05-15 2 0
# 4 2014-06-15 2 0
# 5 2014-02-15 11 0
# 6 2014-04-15 11 0
# 7 2014-05-15 11 1
# 8 2014-06-15 11 0
# 9 2014-02-15 15 2
# 10 2014-04-15 15 0
# 11 2014-05-15 15 0
# 12 2014-06-15 15 1
# 13 2014-02-15 16 0
# 14 2014-04-15 16 0
# 15 2014-05-15 16 0
# 16 2014-06-15 16 1
答案 1 :(得分:1)
IMO,最易读的方式:
已修改以回答您更新的问题
library(dplyr)
library(stringr)
df <- date.data %>%
group_by(
DATE = as.Date(DATE),
HOUR = as.numeric(str_sub(TIME, 1, 2))
) %>%
tally
# create a data frame with all dates/hours
expand.grid(
# include all dates from first to last
DATE = seq.Date(min(df$DATE), max(df$DATE), "day"),
HOUR = 0:23
) %>%
arrange(DATE) %>%
left_join(df, by = c("DATE", "HOUR"))
答案 2 :(得分:1)
附加选项如下。首先,在mutate()
中创建一小时列。然后,您计算DATE
中hour
和count()
存在的数据点数。取消组合数据后,可以连接两个数据框以创建所需的结果。 expand.grid()
部分创建DATE和小时(00到23)的所有组合。由于你有02 for 2,我使用了c(paste0("0", 0:9), 10:23))
。最后,在最终mutate()
中将NA替换为0。
library(dplyr)
library(stringi)
library(data.table)
mutate(mydf, DATE, hour = stri_extract_first(TIME, regex = "\\d+")) %>%
count(DATE, hour) %>%
ungroup %>%
right_join(expand.grid(DATE = unique(.$DATE),
hour = c(paste0("0", 0:9), 10:23))) %>%
mutate(n = replace(n, is.na(n), 0))
# A bit of outcome
# DATE hour n
#1 2014-02-15 00 0
#2 2014-04-15 00 0
#3 2014-05-15 00 0
#4 2014-06-15 00 0
#5 2014-02-15 01 0
使用data.table,您可以执行相同的操作。您为hour
创建了一列,并按DATE
和hour
计算数据点数。然后,您希望将temp
与数据表合并,该数据表具有DATE和小时(00到23)的所有组合。您可以使用CJ()
创建数据表。完成合并过程后,将NA
替换为count {0
)列中的total
。
setDT(mydf)[, hour := stri_extract_first(TIME, regex = "\\d+")][,
list(total = .N), by = list(DATE, hour)] -> temp
merge(temp,
CJ(DATE = unique(mydf$DATE), hour = c(paste0("0", 0:9), 10:23)),
by = c("DATE", "hour"), all = TRUE)[, total := replace(total, is.na(total), 0)][]
# DATE hour total
# 1: 2014-02-15 02 0
# 2: 2014-02-15 11 0
# 3: 2014-02-15 15 2
# 4: 2014-02-15 16 0
# 5: 2014-02-15 00 0
数据强>
mydf <- structure(list(DATE = structure(c(16116, 16116, 16175, 16205,
16236, 16236), class = "Date"), TIME = structure(c(3L, 4L, 1L,
2L, 5L, 6L), .Label = c("02:02", "11:02", "15:02", "15:12", "15:42",
"16:02"), class = "factor")), class = "data.frame", .Names = c("DATE",
"TIME"), row.names = c(NA, -6L))
答案 3 :(得分:0)
您可以通过几种方式执行此操作,但我怀疑最简单的方法是使用table
。使用'table',您可以返回日期的频率。这基本上只是数据框中日期的计数。
提取小时后你可以做同样的事情 - 你甚至可以通过table(DF$DATE,DF$HOUR)
来嵌套它。使用as.data.frame
可以获得与您要查找的内容类似的列表。
已编辑添加:在回答您对问题的修改时,您可以使用factor
级别来获取table
语句中的零级别。 table
通过将它们包含在输出中来尊重您的因子级别,即使它未在输入中找到(事实上,我相信table
强制输入背面的因子)。
示例代码:
# Set options and load example data
options(stringsAsFactors = FALSE)
date.data <- data.frame(DATE = c("2014-02-15","2014-02-15","2014-04-15","2014-05-15","2014-06-15","2014-06-15"),
TIME = c("15:02","15:12","02:02","11:02","15:42","16:02"))
# Extract the hour
date.data$HOUR <- sapply(X = strsplit(x = date.data$TIME,split = ":"),FUN = `[[`,1)
# Now, set the hours as a factor level - this will allow table() to fill the data in as you are requesting
date.data$HOUR <- factor(x = date.data$HOUR,
levels = c("00","01","02","03","04","05",
"06","07","08","09","10","11",
"12","13","14","15","16","17",
"18","19","20","21","22","23"),
labels = c("00","01","02","03","04","05",
"06","07","08","09","10","11",
"12","13","14","15","16","17",
"18","19","20","21","22","23"))
# Obtain the first table of interest
as.data.frame(table(date.data$DATE))
Var1 Freq
1 2014-02-15 2
2 2014-04-15 1
3 2014-05-15 1
4 2014-06-15 2
# And the second table
as.data.frame(table(date.data$DATE,date.data$HOUR))
Var1 Var2 Freq
1 2014-02-15 00 0
2 2014-04-15 00 0
3 2014-05-15 00 0
4 2014-06-15 00 0
5 2014-02-15 01 0
6 2014-04-15 01 0
7 2014-05-15 01 0
8 2014-06-15 01 0
....