我很难找到一种方法来更快地计算R中大向量的中位数和均值。我将如何实现一种更快的方法? 我正在上面的代码,但它太慢了。 我正在考虑并行处理,但是我不知道如何进行这项工作。谢谢。
vector <- 1:10000000000
m <- mean(vector)
md <- median(vector)
答案 0 :(得分:0)
假设我们正在处理一个顺序整数矢量1:n。这可能对您有帮助:
## Given
V <- 1:10e8
n <- length(V)
## To get median,
median <- ifelse(n %% 2 == 0, mean(V [(n/2):((n/2) + 1)]), V [(n + 1)/2])
median
OUTPUT: 5e+08
## To get mean,
sum_series <- n*(n + 1) / 2 # Mathematical Fact
mean <- sum_series / n
mean
OUTPUT: 5e+08
对于较大的随机向量,中位数仍然起作用。您可以估算是否没有封闭公式的平均值:
### Estimation via Repeated Sampling ###
est_mean <- function (V, k, size) {
# k: Number of means to use in estimation
# size: Sample size of each estimation
est <- rep(NA, k)
samp <- matrix(NA, nrow = size, ncol = k)
for (j in 1:k) samp [, j] <- sample(V, size, replace = TRUE)
for (j in 1:k) est [j] <- mean(samp [, j])
est <- sort(est)
return(est [ceiling(length(est)/2)])
}
### Time Complexity of Estimation ###
# samp + est = k*size + k
# If size, k ~ 30 --> Enough to get normal mean distribution
# iterate amount*(create sample vector + store) = k*(size + size)
# --> 2*k*size
# Total = k + 3*k*size --> constant
### Time Complexity of Base R Mean () ###
# Assuming it's this: mean (V) <- sum(V)/length(V)
# sum N items + find length + 1 division + 1 return = N + 3
### Example ###
set.seed(0)
V <- sort(sample(0:10e8, 10e7, replace = TRUE))
start1 <- Sys.time()
est_mu <- est_mean(V, 1000, 30)
end1 <- Sys.time()
diff1 <- end1 - start1
start2 <- Sys.time()
r_mu <- mean (V)
end2 <- Sys.time()
diff2 <- end2 - start2
diff1
OUTPUT: Time difference of 0.08370018 secs
diff2
OUTPUT: Time difference of 0.5321879 secs
print(paste("% Difference = ", abs(r_mu - est_mu)/r_mu))
OUTPUT: "% Difference = 0.00678363793285072"