我有一个循环,不仅导致时间问题,因为要在30k的数据集上运行会花费很长时间,但没有提供正确的输出。
interval <- days(10)
df <- data.frame(CompanyID = c(23512, 51250, 12515, 51250, 52512, 51250),
openingDate = as.Date(c("1999-07-15", "1995-02-01",
"2001-01-25", "1995-02-04", "2004-07-08", "1996-05-25")),
Rev = c(1000, 2000, 4000, 5000, 5500, 2050))
for(id in unique(df$CompanyID)){
df[, c("groupID")] <- NA
df[, c("updatedRev")] <- df$Rev
tempDF <- df %>% filter(CompanyID == id)
if(nrow(tempDF) == 1){
#arbitrary unique ID
df$groupID[df$CompanyID == id] <- paste(id)
}
else{
dfDateRange <- tempDF[order(tempDF$openingDate),]
while(nrow(dfDateRange) > 0) { # until table is empty
earliest_date <- dfDateRange$openingDate[1] #earliest date within subset
within_dates <- dfDateRange %>% filter(openingDate <= earliest_date + interval)
# all data within the interval
# values to replace in subset that falls within date range
# Revenue
df$updatedRev[(df$CompanyID == id) &
(df$openingDate %in% within_dates$openingDate )] <-
sum(within_dates$Rev)
# Generate Group ID
df$groupID[(df$CompanyID == id) &
(df$openingDate %in% within_dates$openingDate)] <- paste(id)
# moving on to rows outside of date range
dfDateRange <- dfDateRange %>%
filter(dfDateRange$openingDate > (earliest_date + interval))
}
}
}
此代码的翻译是:
关于此的两个问题。在运行循环后返回数据帧时,我返回
> df
CompanyID openingDate Rev groupID updatedRev
1 23512 1999-07-15 1000 <NA> 1000
2 51250 1995-02-01 2000 <NA> 2000
3 12515 2001-01-25 4000 <NA> 4000
4 51250 1995-02-04 5000 <NA> 5000
5 52512 2004-07-08 5500 52512 5500
6 51250 1996-05-25 2050 <NA> 2050
什么时候该回来:
CompanyID openingDate Rev groupID updatedRev
1 23512 1999-07-15 1000 23512 1000
2 51250 1995-02-01 2000 51250 7000
3 12515 2001-01-25 4000 12515 4000
4 51250 1995-02-04 5000 51250 7000
5 52512 2004-07-08 5500 52512 5500
6 51250 1996-05-25 2050 51250 2050
*请注意,updatedRev不会更新,以替换公司ID 51250的重复值的总和为10天的范围。
我的另一个问题是,当我将其包装在函数中时-说test()
并运行函数test(df)
-df
不会改变。这是为什么?
注意:稍后我将调整groupId使其更加准确,以便可以在第2行和第4行之间区分第6行。只想提出这个问题,因为这只是行调整。
答案 0 :(得分:3)
假设我正确理解了您的问题说明(不幸的是,您的代码和翻译非常深刻,请参阅下面的评论),下面的tidyverse
方法将重现您的预期输出
library(tidyverse)
df %>%
rowid_to_column("row") %>%
mutate(
openingDate = as.Date(openingDate),
groupID = CompanyID) %>%
group_by(groupID) %>%
arrange(openingDate) %>%
mutate(
diff = openingDate - lag(openingDate),
grp = cumsum(+!(is.na(diff) | diff < 10))) %>%
group_by(groupID, grp) %>%
mutate(updatedRev = sum(Rev)) %>%
ungroup() %>%
select(-diff, -grp) %>%
arrange(row)
## A tibble: 6 x 6
# row CompanyID openingDate Rev groupID updatedRev
# <int> <dbl> <date> <dbl> <dbl> <dbl>
#1 1 23512. 1999-07-15 1000. 23512. 1000.
#2 2 51250. 1995-02-01 2000. 51250. 7000.
#3 3 12515. 2001-01-25 4000. 12515. 4000.
#4 4 51250. 1995-02-04 5000. 51250. 7000.
#5 5 52512. 2004-07-08 5500. 52512. 5500.
#6 6 51250. 1996-05-25 2050. 51250. 2050.
说明:按groupID = CompanyID
对条目进行分组,按openingDate
进行排序,计算连续openingDate
之间的差异,并根据是否连续grp
给出openingDate
标签s在10天内;然后按groupID
和 grp
分组,并创建updatedRev
作为分组的Rev
值的总和。最后一步是整理和重新排列data.frame
以重现预期的输出。
我要创建额外的列row
只是为了帮助将结果与预期输出中的结果进行比较。如果不需要,请删除。
您的代码有多个问题,可读性是其中之一,但其中大多数与for
和while
循环有关。例如,我不明白您在
for(id in unique(df$CompanyID)){
df[, c("groupID")] <- NA
df[, c("updatedRev")] <- df$Rev
您无需在id
循环内的任何地方使用for
,只需执行以下操作即可
df$groupID <- NA
df$updatedRev <- df$Rev
要获得相隔groupID
天的观测值的唯一>=10
,可以将companyID
和grp
串联
df %>%
rowid_to_column("row") %>%
mutate(openingDate = as.Date(openingDate)) %>%
group_by(CompanyID) %>%
arrange(openingDate) %>%
mutate(
diff = openingDate - lag(openingDate),
grp = cumsum(+!(is.na(diff) | diff < 10)),
groupID = paste(CompanyID, grp, sep = "_")) %>%
group_by(groupID, grp) %>%
mutate(updatedRev = sum(Rev)) %>%
ungroup() %>%
select(-diff, -grp) %>%
arrange(row)
## A tibble: 6 x 6
# row CompanyID openingDate Rev groupID updatedRev
# <int> <dbl> <date> <dbl> <chr> <dbl>
#1 1 23512. 1999-07-15 1000. 23512_0 1000.
#2 2 51250. 1995-02-01 2000. 51250_0 7000.
#3 3 12515. 2001-01-25 4000. 12515_0 4000.
#4 4 51250. 1995-02-04 5000. 51250_0 7000.
#5 5 52512. 2004-07-08 5500. 52512_0 5500.
#6 6 51250. 1996-05-25 2050. 51250_1 2050.