在我的数据中,我有客户编号,订单日期和指示符(如果订单包含某种产品)。 我想给每个客户一个指标,如果他的第一笔订单包含此类产品。但是因为我的数据很大,所以我不能使用group_by和case_when,因为它太慢了。我认为我可以通过使用data.table来加快处理速度。
您能给我指出一个解决方案吗?到目前为止,我还没有与data.table联系...
=IIF(Fields!OpenstaandeTijdInUren.Value > 1, "Orange", "No Color")
答案 0 :(得分:2)
另一种方式:
library(data.table)
DT = data.table(df[, 1:3])
lookupDT = DT[, .(date = min(date)), by=id]
lookupDT[, fx := DT[copy(.SD), on=.(id, date), max(indicator), by=.EACHI]$V1]
DT[, v := "Customer without x in first order"]
DT[lookupDT[fx == 1L], on=.(id), v := "Customer with X in first order"]
# check results
fsetequal(DT[, .(id, v)], data.table(id = df$id, v = df$Customer_type))
# [1] TRUE
如果您想进一步提高速度,请参见?IDate
。
由于an open issue,需要copy
上的.SD
。
答案 1 :(得分:0)
以下是您可以更有效地使用dplyr改进现有代码的方法:
lookup = data.frame(First_Order_contains_x = c(TRUE, FALSE),
Customer_Type = c("Customer with X in first order",
"Customer without x in first order"))
df %>%
group_by(id) %>%
mutate(First_Order_contains_x = any(as.integer(date == min(date) & indicator == 1))) %>%
ungroup() %>%
left_join(lookup, by = "First_Order_contains_x")
# A tibble: 3,000 x 5
id date indicator First_Order_contains_x Customer_Type
<fct> <date> <dbl> <lgl> <fct>
1 5056 2018-03-10 1 TRUE Customer with X in first order
2 5291 2018-12-28 0 FALSE Customer without x in first order
3 5173 2018-04-19 0 FALSE Customer without x in first order
4 5159 2018-11-13 0 TRUE Customer with X in first order
5 5252 2018-05-30 0 TRUE Customer with X in first order
6 5200 2018-01-20 0 FALSE Customer without x in first order
7 4578 2018-12-18 1 FALSE Customer without x in first order
8 5308 2018-03-24 1 FALSE Customer without x in first order
9 5234 2018-05-29 1 TRUE Customer with X in first order
10 5760 2018-06-12 1 TRUE Customer with X in first order
# … with 2,990 more rows
答案 2 :(得分:0)
另一种data.table
方法。首先对数据进行排序,以使第一个日期为最早的日期,然后我们可以使用第一个指标测试条件。然后,将逻辑转换为整数(FALSE
-> 1
和TRUE
-> 2
)并使用字符向量映射为所需的输出。
library(data.table)
setDT(df)
setorder(df, id, date)
map <- c("Customer without x in first order", "Customer with X in first order")
df[, idx := 1L+any(indicator[1L]==1L), by=.(id)][,
First_Order_contains_x := map[idx]]
如果原始订单很重要,我们可以先使用df[, rn := .I]
,然后再使用setorder(df, rn)
存储原始订单。
数据:
set.seed(0L)
id <- round(rnorm(3000, mean = 5000, 5),0)
date <- seq.Date(as.Date("2018-01-01"), as.Date("2018-12-31"), "day")
date <- sample(date, length(id), replace = TRUE)
indicator <- rbinom(length(id), 1, 0.5)
df <- data.frame(id, date, indicator)
df$id <- as.factor(df$id)