我希望使用特定的加权函数以向量化的方式计算时间加权平均值。
我已经找到了如何在for循环中执行此操作(下面的代码),但是我正在处理约10万行的数据集,并且希望找到一个向量化方法。我怀疑这需要apply()
家族提供的某些东西,并且尝试并未能建立起可以在apply()
中工作的函数。我已经写了一个函数(我认为?)来对解决方案进行矢量化处理,但是我不确定自己是否正确或有效地完成了该工作,我希望那里有更好的解决方案。
编辑:txhousing
可在ggplot2
包中找到
library("tidyverse")
normal_fn <- function(x, mu = 0, theta_sq = 1){
y <- (1 / sqrt(2 * pi * theta_sq)) * exp((-1*((x - mu)^2)) / (2 * theta_sq))
return(y)
}
last_n <- 50
weights_df <- data.frame(weight = normal_fn(seq(0, 3, length.out = last_n)),
rank = seq(last_n))
txhousing.mutated <- txhousing %>%
filter(city %in% c("Austin", "Houston", "El Paso")) %>%
mutate(date = lubridate::as_date(paste0(year, "-", month, "-01"))) %>%
select(city, listings, date) %>%
group_by(city) %>%
arrange(date) %>%
mutate(date_rank = rank(date))
tw_txhousing.list <- vector(mode = "list", length = nrow(txhousing.mutated))
for(i in seq(nrow(txhousing.mutated))){
txhousing.this <- txhousing.mutated %>%
filter(city == txhousing.mutated[[i, "city"]] & date_rank < txhousing.mutated[[i, "date_rank"]]) %>%
arrange(date_rank) %>%
tail(last_n) %>%
mutate(this_date_rank = rev(row_number()))
tw_txhousing.list[[i]] <- txhousing.this %>%
left_join(weights_df, by = c("this_date_rank" = "rank")) %>%
summarise(tw_listings = weighted.mean(listings, w = weight, na.rm = T)) %>%
mutate(date = txhousing.mutated[[i, "date"]])
}
tw_txhousing.df2 <- txhousing.mutated %>%
left_join(data.table::rbindlist(tw_txhousing.list), by = c("city", "date"))
semi_vec <- function(data_df, weights, id, rank, prev){
txhousing.this <- txhousing.mutated %>%
filter(city == id & date_rank < rank) %>%
arrange(date_rank) %>%
tail(last_n) %>%
mutate(this_date_rank = rev(row_number()))
data.out <- txhousing.this %>%
left_join(weights_df, by = c("this_date_rank" = "rank")) %>%
summarise(tw_listings = weighted.mean(listings, w = weight, na.rm = T)) %>%
pull(tw_listings)
if(length(data.out)<1){
data.out <- NA_real_
}
return(data.out)
}
tw_txhousing.df <- txhousing.mutated %>%
rowwise() %>%
mutate(tw_listings = semi_vec(data_df = txhousing.mutated, weights = weights_df, id = city, rank = date_rank, prev = last_n)) %>%
ungroup()
我希望能在较短的时间内获得与以上所示相同的结果。我使用了一个简单的示例,但是如果该函数允许以列表的形式进行多个输出(例如weighted.mean(na.rm = F)以及(na.rm = T),则将很有帮助。或多列,以最简单的为准。
谢谢!