我有几百个重复的主键和与这些键关联的日期。日期可能有也可能没有缺少条目,但需要缺少条目需要用max(日期)替换。
#Create Proxy dataframe
df <- tibble(
key = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e", "f", "f", "h", "h", "i","i", "j", "j", "k", "k", "l", "l", "m", "m"),
date1 = c("NA", "2017-02-13", "NA", "2017-04-14", "2017-05-18", "2017-05-18", "NA", "2018-01-07",
"2017-09-24", "2017-09-25", "NA", "2017-09-29", "NA", "2017-08-13", "NA", "2017-04-29",
"NA", "2018-01-28", "NA", "2017-10-08", "NA", "2017-01-10", "NA", "2017-11-01")
)
df$date1 <- as.Date(df$date1, format = "%Y-%m-%d")
请注意
-key&#34; a&#34;有一个缺少的日期,需要用唯一可用的日期替换
-key&#34; c&#34;没有任何遗漏日期
-key&#34; e&#34;有两个不同的日期,但最后一个日期需要记录
df
# A tibble: 24 x 2
key date1
<chr> <date>
1 a NA
2 a 2017-02-13
3 b NA
4 b 2017-04-14
5 c 2017-05-18
6 c 2017-05-18
7 d NA
8 d 2018-01-07
9 e 2017-09-24
10 e 2017-09-25
# ... with 14 more rows
我尝试过的解决方案不起作用:
library(lubridate)
df$date <- with(df$date, as.Date(ifelse(is.na(df$date), orderDate, df$date), origin = "1970-01-01"))
library(dplyr)
df %>% group_by(key) %>%
mutate(date = (date, NA, df$date)) %>%
as.data.frame
任何帮助将不胜感激!谢谢!
答案 0 :(得分:1)
假设您只想在+------+------------+--+
| id | _c1 |
+------+------------+--+
| abc | [1,2,2,1] |
+------+------------+--+
为NA时替换为每个组中的max()
值,这将有效。请注意,您需要指定date1
,因为na.rm = TRUE
会返回NA,而不是1。
max(NA, 1)
答案 1 :(得分:0)
有一种替代方法比Mako212's dplyr
solution快得多。它在加入时使用更新,将NA
值替换为每个max(date1)
组的key
:
library(data.table)
DT <- as.data.table(df)
tmp <- DT[, .(date1 = as.Date(NA), max(date1, na.rm = TRUE)), by = key]
DT[tmp, on = .(key, date1), date1 := V2][]
key date1 1: a 2017-02-13 2: a 2017-02-13 3: b 2017-04-14 4: b 2017-04-14 5: c 2017-05-18 6: c 2017-05-18 7: d 2018-01-07 8: d 2018-01-07 9: e 2017-09-24 10: e 2017-09-25 11: f 2017-09-29 12: f 2017-09-29 13: h 2017-08-13 14: h 2017-08-13 15: i 2017-04-29 16: i 2017-04-29 17: j 2018-01-28 18: j 2018-01-28 19: k 2017-10-08 20: k 2017-10-08 21: l 2017-01-10 22: l 2017-01-10 23: m 2017-11-01 24: m 2017-11-01 key date1
请注意,只有date1
为NA
的行被替换为到位,即不复制整个数据对象。
tmp
包含每个key
组的replacemnet值:
key date1 V2 1: a <NA> 2017-02-13 2: b <NA> 2017-04-14 3: c <NA> 2017-05-18 4: d <NA> 2018-01-07 5: e <NA> 2017-09-25 6: f <NA> 2017-09-29 7: h <NA> 2017-08-13 8: i <NA> 2017-04-29 9: j <NA> 2018-01-28 10: k <NA> 2017-10-08 11: l <NA> 2017-01-10 12: m <NA> 2017-11-01
创建基准数据:
library(dplyr)
library(data.table)
n_row <- 1e5L
n_key <- 500L
share_na <- 0.5
set.seed(123L)
DT0 <- data.table(
key1 = sprintf("%04i", sample.int(n_key, n_row, TRUE)),
date1 = as.Date("2017-01-01") + sample.int(n_key, n_row, TRUE)
)
# set NA values
DT0[sample.int(n_row, share_na * n_row), date1 := NA]
# coerce to tibble
df0 <- as_tibble(DT0)
运行基准:
library(microbenchmark)
bm <- microbenchmark(
dplyr = {
copy(df0) %>% group_by(key1) %>%
mutate(date1 = case_when(
is.na(date1) ~ max(date1, na.rm = TRUE),
TRUE ~ date1)
)
},
dt = {
DT <- copy(DT0)
tmp <- DT[, .(date1 = as.Date(NA), max(date1, na.rm = TRUE)), by = key1]
DT[tmp, on = .(key1, date1), date1 := V2][]
},
times = 21L
)
print(bm)
Unit: milliseconds expr min lq mean median uq max neval cld dplyr 131.02040 136.81967 142.63845 137.78741 141.36084 191.37755 21 b dt 18.14997 18.68349 19.65384 19.32424 19.54815 26.87965 21 a
对于给定问题大小为100 k行,500个组和50%NA
值,data.table
方法比dplyr
版本快7倍左右。
请注意,DT0
和df0
的新副本用于每次重复,因为DT
已就地更新。对copy()
的调用包含在两个案例的时间安排中。已修改dplyr
版本以更新date1
,而不是在输出中创建第三列。