我有这个数据框
r2 distance
1 33.64 67866
2 8.50 77229
3 15.07 109119
4 24.35 142279
5 7.74 143393
6 8.21 177670
7 12.26 216440
8 12.66 253751
9 26.31 282556
10 39.08 320816
我需要为每个行块计算列r2
的平均值,其中列distance
中两个值之间的距离等于或小于100000
。
对于此示例,所需的输出将是:
mean_r2 diff_of_distance
1 17.86 75527 ## mean of rows 1 to 5; distance 5 - distance 1
2 13.91 66164 ## mean of rows 2 to 5; distance 5 - distance 2
3 13.84 68551 ## mean of rows 3 to 6; distance 6 - distance 3
4 13.14 74161 ## mean of rows 4 to 7; distance 7 - distance 4
5 9.40 73047 ## mean of rows 5 to 7; distance 7 - distance 5
6 11.04 76081 ## mean of rows 6 to 8; distance 8 - distance 6
等等。
编辑1:我有超过100,000行。
感谢。
答案 0 :(得分:0)
您可以尝试:
# your data
d <- read.table(text="r2 distance
1 33.64 67866
2 8.50 77229
3 15.07 109119
4 24.35 142279
5 7.74 143393
6 8.21 177670
7 12.26 216440
8 12.66 253751
9 26.31 282556
10 39.08 320816", header=T)
library(tidyverse) #dplyr_0.7.2
d %>%
mutate(index=1:n()) %>% add row index
group_by(index) %>% # group by this index
# calculate difference and find max row where diff < 100000
mutate(max_row=max(which(.$distance - distance < 100000, arr.ind=T))) %>%
# calculate mean
mutate(mean_r2=mean(.$r2[index:max_row])) %>%
# calculate the difference
mutate(diff_of_distance=.$distance[max_row] - .$distance[index]) %>%
# unite the columns
unite(rows, index, max_row, sep = "-")
# A tibble: 10 x 5
r2 distance rows mean_r2 diff_of_distance
* <dbl> <int> <chr> <dbl> <int>
1 33.64 67866 1-5 17.860000 75527
2 8.50 77229 2-5 13.915000 66164
3 15.07 109119 3-6 13.842500 68551
4 24.35 142279 4-7 13.140000 74161
5 7.74 143393 5-7 9.403333 73047
6 8.21 177670 6-8 11.043333 76081
7 12.26 216440 7-9 17.076667 66116
8 12.66 253751 8-10 26.016667 67065
9 26.31 282556 9-10 32.695000 38260
10 39.08 320816 10-10 39.080000 0
这是因为group_by子集了数据帧,因此您可以在mutate
范围内访问每个组的相应distance
值,并使用.$distance
计算与完整向量的差异,因为这可以访问完整列无论group_by()
函数如何。
答案 1 :(得分:0)
循环遍历distance
的每个值,从distance
向量中的值减去此值,并测试结果是否小于100000.这会创建一个布尔向量,您可以求和以标识索引距离大于100000(即bool变为FALSE)。使用此索引标识您的块,然后在每个块中取r2
的平均值。
为了加快代码定义你的矢量类型和长度(避免每次迭代“增长矢量”。
means <- vector("numeric", length = nrow(df))
rows <- vector("numeric", length = nrow(df))
distance_diff <- vector("numeric", length = nrow(df))
for (i in seq_along(df$distance)) {
dis_val <- df$distance[i] # the ith distance value
bools <- (df$distance - dis_val) < 100000 # bool indicating if difference between i and every value in vector is less than 100000
block_range <- sum(bools)# taking sum of bools identifies the value at which the distance becomes > 100000
rows[i] <- paste(as.character(i), "-", as.character(block_range))
means[i] <- mean(df$r2[i:block_range]) # take the mean of r2 in the range i to all rows where distance is < 100000
distance_diff[i] <- df$distance[block_range] - dis_val # minus the distance from the value before distance is > 100000 from i
}
data.frame(mean_r2 = means, rows= rows, diff_of_distance=distance_diff)
mean_r2 rows diff_of_distance
1 17.860000 1 - 5 75527
2 13.915000 2 - 5 66164
3 13.842500 3 - 6 68551
4 13.140000 4 - 7 74161
5 9.403333 5 - 7 73047
6 11.043333 6 - 8 76081
7 17.076667 7 - 9 66116
8 26.016667 8 - 10 67065
9 32.695000 9 - 10 38260
10 39.080000 10 - 10 0