我对将数据分组到特定类别有疑问。
通常情况下,如果我有一个因子变量,我会执行类似下面的操作来将数据转储/重新编码为首选模式:
educ = NA
educ[educ2 %in% levels(educ2)[c(5,8)]] <- "HS or Some College"
educ[educ2 %in% levels(educ2)[2:3]] <- "College Degree"
educ[educ2 %in% levels(educ2)[c(4,6)]] <- "Advanced Degree"
educ[educ2 %in% levels(educ2)[c(1,7,9)]] <- NA
educ = factor(educ)
然而,我正在努力重新组合一个因子变量TIME,它有10,000 +级别。数据结构如下:
> levels(wj$time)
[1] "0:00:05" "0:00:07" "0:00:08" "0:00:10" "0:00:13" "0:00:15" "0:00:18" "0:00:23" "0:00:31" "0:00:34" "0:00:36"
[12] "0:00:39" "0:00:41" "0:00:47" "0:00:48" "0:00:54" "0:00:55" "0:00:56" "0:00:59" "0:01:01" "0:01:02" "0:01:03"
[23] "0:01:13" "0:01:17" "0:01:31" "0:01:33" "0:01:41" "0:01:44" "0:01:48" "0:01:50" "0:01:52" "0:01:53" "0:01:55"
[34] "0:02:08" "0:02:12" "0:02:13" "0:02:21" "0:02:26" "0:02:27" "0:02:30" "0:02:32" "0:02:33" "0:02:36" "0:02:37"
[45] "0:02:38" "0:02:43" "0:02:45" "0:02:53" "0:02:56" "0:03:07" "0:03:15" "0:03:19" "0:03:21" "0:03:22" "0:03:24"
[56] "0:03:30" "0:03:36" "0:03:39" "0:03:41" "0:03:49" "0:03:56" "0:03:59" "0:04:02" "0:04:04" "0:04:07" "0:04:10"
[67] "0:04:11" "0:04:12" "0:04:14" "0:04:16" "0:04:17" "0:04:19" "0:04:22" "0:04:27" "0:04:28" "0:04:30" "0:04:37"
[78] "0:04:39" "0:04:41" "0:04:49" "0:04:51" "0:04:52" "0:04:53" "0:04:54" "0:05:05" "0:05:06" "0:05:20" "0:05:22"
当有很多因子水平时,我只是不确定如何快速将数据存入特定括号。我想将它们分为0:12:00 to 0:05:00
和0:05:01 to 0:10:00
等等。有这么多因素水平,我只是在如何确定何时开始和结束分组方面有点迷失。有人可以提供任何帮助吗?拥有10,000多个桶,这成为我传统做事的一个问题。
谢谢!
答案 0 :(得分:4)
您可以将时间戳拆分为其组件:这些桶很容易计算。
# Sample data
n <- 10
d <- data.frame(
time = paste(
sample(0:23, n, replace=TRUE),
sample(0:59, n, replace=TRUE),
sample(0:59, n, replace=TRUE),
sep=":"
),
value = rnorm(n)
)
# Split the "time" column into its components
d$time <- as.character( d$time )
times <- strsplit( d$time, ":" )
times <- lapply( times, as.numeric )
times <- do.call(rbind, times)
colnames(times) <- c("hour", "minute", "second")
d <- cbind(times, d)
# Build the buckets
d$bucket <- paste(
sprintf( "%02d:%02d:00", d$hour, floor( d$minute / 5 ) * 5 ),
sprintf( "%02d:%02d:59", d$hour, floor( d$minute / 5 ) * 5 + 4 ),
sep=" to "
)
答案 1 :(得分:1)
您遇到的问题是您有一个有效的连续变量,您以特定字符格式表示该变量存储为因子。这里的因素并不合适,因为这些级别只表示数据中出现的值,而不是预定义的一组可能值。它是字符向量的事实是因为它表示格式化数据类型的特定约定,即时间。我猜想它是几小时:分钟:秒,但考虑到你的例子,它可能是几天(?):小时:分钟。如果是小时:分钟:秒,那么最好将这些时间表示为times
包中的chron
对象。如果这样做,那么问题就变成了如何将连续变量分类为离散组。这是通过cut
函数完成的。
答案 2 :(得分:0)
结合@Brian Diggs&amp;的答案/代码@Vincent Zoonekynd,我会推荐一些功能:
?strptime
?POSIXlt
?cut.POSIXt
#create factorized time vector within data frame
n <- 10
d <- data.frame(
time = as.factor(paste(
sample(0:23, n, replace=TRUE),
sample(0:59, n, replace=TRUE),
sample(0:59, n, replace=TRUE),
sep=":"
)),
value = rnorm(n)
)
#convert to time format, then apply cuts per hour
(d$time<- cut.POSIXt(strptime(d$time, format="%H:%M:%S"), breaks="hour"))
如果你不想每小时休息,你可以使用“day”或其他东西。您也可以查看我们的this链接,查看您的问题的答案,我通过查找“将字符串转换为时间”找到了该答案。
HTH。