我仍然在理解data.table表示法时遇到一些问题。谁能解释为什么以下不起作用?
我正在尝试使用cut
将日期分组。使用的中断可以在另一个data.table中找到,并且取决于外部“data”data.table的by
参数。
data <- data.table(A = c(1, 1, 1, 2, 2, 2),
DATE = as.POSIXct(c("01-01-2012", "30-05-2015", "01-01-2020", "30-06-2012", "30-06-2013", "01-01-1999"), format = "%d-%m-%Y"))
breaks <- data.table(B = c(1, 1, 2, 2),
BREAKPOINT = as.POSIXct(c("01-01-2015", "01-01-2016", "30-06-2012", "30-06-2013"), format = "%d-%m-%Y"))
data[, bucket := cut(DATE, breaks[B == A, BREAKPOINT], ordered_result = T), by = A]
我可以做到所需的结果
# expected
data[A == 1, bucket := cut(DATE, breaks[B == 1, BREAKPOINT], ordered_result = T)]
data[A == 2, bucket := cut(DATE, breaks[B == 2, BREAKPOINT], ordered_result = T)]
data
# A DATE bucket
# 1: 1 2012-01-01 NA
# 2: 1 2015-05-30 2015-01-01
# 3: 1 2020-01-01 NA
# 4: 2 2012-06-30 2012-06-30
# 5: 2 2013-06-30 NA
# 6: 2 1999-01-01 NA
谢谢, 迈克尔
答案 0 :(得分:5)
问题是cut
会产生因素,并且data.table
by
操作中没有正确处理这些因素(这是一个错误,应该报告 - 应该处理因子级别与rbind.data.table
或rbindlist
中的处理方式相同。对原始表达式的一个简单修复是转换为字符:
data[, bucket := as.character(cut(DATE, breaks[B == A, BREAKPOINT], ordered_result = T))
, by = A]
# A DATE bucket
#1: 1 2012-01-01 NA
#2: 1 2015-05-30 2015-01-01
#3: 1 2020-01-01 NA
#4: 2 2012-06-30 2012-06-30
#5: 2 2013-06-30 NA
#6: 2 1999-01-01 NA