应用错误收集

根据指定的加权函数的时间加权平均值的矢量化方法

时间：2019-05-16 20:47:56

标签： r

我希望使用特定的加权函数以向量化的方式计算时间加权平均值。

我已经找到了如何在for循环中执行此操作（下面的代码），但是我正在处理约10万行的数据集，并且希望找到一个向量化方法。我怀疑这需要apply()家族提供的某些东西，并且尝试并未能建立起可以在apply()中工作的函数。我已经写了一个函数（我认为？）来对解决方案进行矢量化处理，但是我不确定自己是否正确或有效地完成了该工作，我希望那里有更好的解决方案。

编辑：txhousing可在ggplot2包中找到

构建数据集：

library("tidyverse")

normal_fn <- function(x, mu = 0, theta_sq = 1){
  y <- (1 / sqrt(2 * pi * theta_sq)) * exp((-1*((x - mu)^2)) / (2 * theta_sq))

  return(y)
}

last_n <- 50

weights_df <- data.frame(weight = normal_fn(seq(0, 3, length.out = last_n)),
                         rank = seq(last_n))

txhousing.mutated <- txhousing %>% 
  filter(city %in% c("Austin", "Houston", "El Paso")) %>% 
  mutate(date = lubridate::as_date(paste0(year, "-", month, "-01"))) %>% 
  select(city, listings, date) %>% 
  group_by(city) %>% 
  arrange(date) %>% 
  mutate(date_rank = rank(date))

方法1：用于循环

tw_txhousing.list <- vector(mode = "list", length = nrow(txhousing.mutated))
for(i in seq(nrow(txhousing.mutated))){

  txhousing.this <- txhousing.mutated %>% 
    filter(city == txhousing.mutated[[i, "city"]] & date_rank < txhousing.mutated[[i, "date_rank"]]) %>% 
    arrange(date_rank) %>% 
    tail(last_n) %>% 
    mutate(this_date_rank = rev(row_number()))

  tw_txhousing.list[[i]] <- txhousing.this %>% 
    left_join(weights_df, by = c("this_date_rank" = "rank")) %>% 
    summarise(tw_listings = weighted.mean(listings, w = weight, na.rm = T)) %>% 
    mutate(date = txhousing.mutated[[i, "date"]])

}

tw_txhousing.df2 <- txhousing.mutated %>% 
  left_join(data.table::rbindlist(tw_txhousing.list), by = c("city", "date"))

方法2：逐行fn

semi_vec <- function(data_df, weights, id, rank, prev){
  txhousing.this <- txhousing.mutated %>% 
    filter(city == id & date_rank < rank) %>% 
    arrange(date_rank) %>% 
    tail(last_n) %>% 
    mutate(this_date_rank = rev(row_number()))

  data.out <- txhousing.this %>% 
    left_join(weights_df, by = c("this_date_rank" = "rank")) %>% 
    summarise(tw_listings = weighted.mean(listings, w = weight, na.rm = T)) %>% 
    pull(tw_listings)

  if(length(data.out)<1){
    data.out <- NA_real_
  }

  return(data.out)
}

tw_txhousing.df <- txhousing.mutated %>%
  rowwise() %>% 
  mutate(tw_listings = semi_vec(data_df = txhousing.mutated, weights = weights_df, id = city, rank = date_rank, prev = last_n)) %>% 
  ungroup()

我希望能在较短的时间内获得与以上所示相同的结果。我使用了一个简单的示例，但是如果该函数允许以列表的形式进行多个输出（例如weighted.mean（na.rm = F）以及（na.rm = T），则将很有帮助。或多列，以最简单的为准。

谢谢！

0 个答案:

没有答案