我正在尝试对依赖于组中两个先前元素的分组数据执行迭代计算。作为一个玩具的例子:
set.seed(100)
df = data.table(ID = c(rep("A_index1",9)),
Year = c(2001:2005, 2001:2004),
Price = c(NA, NA, 10, NA, NA, 15, NA, 13, NA),
Index = sample(seq(1, 3, by = 0.5), size = 9, replace = TRUE))
ID Year Price Index
R> df
1: A_index1 2001 NA 1.5
2: A_index1 2002 NA 1.5
3: A_index1 2003 10 2.0
4: A_index1 2004 NA 1.0
5: A_index1 2005 NA 2.0
6: A_index1 2006 15 2.0
7: A_index1 2007 NA 3.0
8: A_index1 2008 13 1.5
9: A_index1 2009 NA 2.0
目标是使用最后可用价格和要调整的指数来填补缺失的价格。我有一个执行这些计算的循环,我试图使用dplyr
进行矢量化。
我的逻辑在以下循环中定义:
df$Price_adj = df$Price
for (i in 2:nrow(df)) {
if (is.na(df$Price[i])) {
df$Price_adj[i] = round(df$Price_adj[i-1] * df$Index[i] / df$Index[i-1], 2)
}
}
R> df
ID Year Price Index Price_adj
1: A_index1 2001 NA 1.5 NA
2: A_index1 2002 NA 1.5 NA
3: A_index1 2003 10 2.0 10.00
4: A_index1 2004 NA 1.0 5.00
5: A_index1 2005 NA 2.0 10.00
6: A_index1 2006 15 2.0 15.00
7: A_index1 2007 NA 3.0 22.50
8: A_index1 2008 13 1.5 13.00
9: A_index1 2009 NA 2.0 17.33
在我的实际大数据中,我将不得不将此功能应用于多个组,速度是一个考虑因素。我的尝试在下面,需要帮助指出我正确的方向。我确实考虑了Reduce
,但不确定它如何在组中包含两个先前的元素。
foo = function(Price, Index){
for (i in 2:nrow(df)) {
if (is.na(df$Price[i])) {
df$Price_adj[i] = df$Price_adj[i-1] * df$Index[i] / df$Index[i-1]
}
}
}
df %>%
group_by(ID) %>%
mutate(Price_adj = Price,
Price_adj = foo(Price, Index))
答案 0 :(得分:3)
cumprod
的一个选项:
df %>%
# group data frame into chunks starting from non na price
group_by(ID, g = cumsum(!is.na(Price))) %>%
# for each chunk multiply the first non na price with the cumprod of Index[i]/Index[i-1]
mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>%
ungroup() %>% select(-g)
# A tibble: 9 x 5
# ID Year Price Index Price_adj
# <fctr> <int> <dbl> <dbl> <dbl>
#1 A_index1 2001 NA 1.5 NA
#2 A_index1 2002 NA 1.5 NA
#3 A_index1 2003 10 2.0 10.00
#4 A_index1 2004 NA 1.0 5.00
#5 A_index1 2005 NA 2.0 10.00
#6 A_index1 2001 15 2.0 15.00
#7 A_index1 2002 NA 3.0 22.50
#8 A_index1 2003 13 1.5 13.00
#9 A_index1 2004 NA 2.0 17.33
按ID
和cumsum(!is.na(Price))
分组数据框,字母将数据框拆分成块,每个块以非NA价格开头;
first(Price) * cumprod(Index / lag(Index, default=first(Index)))
执行迭代计算,如果您将Price_adj[i-1]
替换为Price_adj[i-2]
直到它Price_adj[1]
,则等效于问题中给出的公式}或first(Price)
;
警告:如果你有很多NA块,可能效率不高。
如果速度是主要考虑因素,您可以使用Rcpp
包编写函数:
library(Rcpp)
cppFunction("
NumericVector price_adj(NumericVector price, NumericVector index) {
int n = price.size();
NumericVector adjusted_price(n);
adjusted_price[0] = price[0];
for (int i = 1; i < n; i++) {
if(NumericVector::is_na(price[i])) {
adjusted_price[i] = adjusted_price[i-1] * index[i] / index[i-1];
} else {
adjusted_price[i] = price[i];
}
}
return adjusted_price;
}")
现在使用cpp
函数和dplyr
,如下所示:
cpp_fun <- function() df %>% group_by(ID) %>% mutate(Price_adj = round(price_adj(Price, Index), 2))
cpp_fun()
# A tibble: 9 x 5
# Groups: ID [1]
# ID Year Price Index Price_adj
# <fctr> <int> <dbl> <dbl> <dbl>
#1 A_index1 2001 NA 1.5 NA
#2 A_index1 2002 NA 1.5 NA
#3 A_index1 2003 10 2.0 10.00
#4 A_index1 2004 NA 1.0 5.00
#5 A_index1 2005 NA 2.0 10.00
#6 A_index1 2001 15 2.0 15.00
#7 A_index1 2002 NA 3.0 22.50
#8 A_index1 2003 13 1.5 13.00
#9 A_index1 2004 NA 2.0 17.33
基准:
将r_fun
定义为:
r_fun <- function() df %>% group_by(ID, g = cumsum(!is.na(Price))) %>% mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>% ungroup() %>% select(-g)
在小样本数据上,已经存在差异:
microbenchmark::microbenchmark(r_fun(), cpp_fun())
#Unit: milliseconds
# expr min lq mean median uq max neval
# r_fun() 10.127839 10.500281 12.627831 11.148093 12.686662 101.466975 100
# cpp_fun() 3.191278 3.308758 3.738809 3.491495 3.937006 6.627019 100
在稍大的数据框架上进行测试:
df <- bind_rows(rep(list(df), 10000))
#dim(df)
#[1] 90000 4
microbenchmark::microbenchmark(r_fun(), cpp_fun(), times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval
# r_fun() 842.706134 890.978575 904.70863 908.77042 921.89828 986.44576 10
# cpp_fun() 8.722794 8.888667 10.67781 10.86399 12.10647 13.68302 10
身份测试:
identical(ungroup(r_fun()), ungroup(cpp_fun()))
# [1] TRUE