我的数据如下:
df <- data.frame(
x = c("dog", "dog", "dog", "cat", "cat", "fish", "fish", "fish", "squid", "squid", "squid"),
y = c(10, 11, 6, 3, 4, 5, 5, 9, 14, 33, 16)
)
我想迭代数据并在某个“包含/过滤器”列表中为每只动物抓取一个值,然后将它们加在一起。
例如,也许我只关心狗,猫和鱼。
animals <- c("dog", "cat", "fish")
在重采样1中,我可以获得10,4,9(总和= 23),而在重采样2中,我可以获得6,3,5(总和= 14)。
我只是掀起了一个非常笨拙的复制/关于dplyr
倾斜的功能,但它似乎超级低效:
ani_samp <- function(animals){
total <- 0
for (i in animals) {
v <- df %>%
filter(x == i) %>%
sample_n(1) %>%
select(y) %>%
as.numeric()
total <- total + v
}
return(total)
}
replicate(1000,ani_samp(animals))
我如何改进此重采样/伪引导代码?
答案 0 :(得分:3)
我不确定这是否更好(没有时间进行基准测试),但你可以在这里避免双循环。您可以先按animals
进行过滤(然后对子集进行处理),然后从每个组中仅对n
个样本进行一次采样。如果您喜欢dplyr
,可以使用dplyr/tidyr
版本
library(tidyr)
library(dplyr)
ani_samp <- function(animals, n){
df %>%
filter(x %in% animals) %>% # Work on a subset
group_by(x) %>%
sample_n(n, replace = TRUE) %>% # sample only once per each group
group_by(x) %>%
mutate(id = row_number()) %>% # Create an index for rowSums
spread(x, y) %>% # Convert to wide format for rowSums
mutate(res = rowSums(.[-1])) %>% # Sum everything at once
.$res # You don't need this if you want a data.frame result instead
}
set.seed(123) # For reproducible output
ani_samp(animals, 10)
# [1] 18 24 14 24 19 18 19 19 19 14
答案 1 :(得分:1)
另一种方法是:
set.seed(123) ## for reproducibility
n <- 1000 ## number of samples for each animal
samps <- do.call(cbind, lapply(animals, function(x) {sample(df$y[df$x == x], n, replace=TRUE)}))
head(samps, 10)
## [,1] [,2] [,3]
## [1,] 10 3 5
## [2,] 6 4 5
## [3,] 11 3 5
## [4,] 6 4 5
## [5,] 6 4 5
## [6,] 10 3 5
## [7,] 11 4 5
## [8,] 6 3 5
## [9,] 11 3 5
##[10,] 11 3 5
sum <- as.vector(samps %*% rep(1,length(animals)))
head(sum, 10)
##[1] 18 15 19 15 15 18 20 14 19 19
在这里,我们使用lapply
循环animals
并生成1000个df$y
样本,df$x
使用sample
匹配动物并替换。然后,我们cbind
将结果samp
放在一起,以便animals
的每一行都是system.time
的样本。最后一行是使用矩阵乘法的行和。
animal
对于每个n <- 1000 ## number of samples for each animal
system.time(as.vector(do.call(cbind, lapply(animals, function(x) {sample(df$y[df$x == x], n, replace=TRUE)})) %*% rep(1,length(animals))))
## user system elapsed
## 0.001 0.000 0.001
的1000个样本几乎是即时的:
n
这也应该适用于样本数量 public void printReflectionClassNames(){
StringBuffer buffer = new StringBuffer();
Class clazz= buffer.getClass();
System.out.println("Reflection on String Buffer Class");
System.out.println("Name: "+clazz.getName());
System.out.println("Simple Name: "+clazz.getSimpleName());
System.out.println("Canonical Name: "+clazz.getCanonicalName());
System.out.println("Type Name: "+clazz.getTypeName());
}
outputs:
Reflection on String Buffer Class
Name: java.lang.StringBuffer
Simple Name: StringBuffer
Canonical Name: java.lang.StringBuffer
Type Name: java.lang.StringBuffer
。