根据条件获取n行中每个块的列

时间:2017-09-29 14:38:02

标签: r loops dataframe aggregate apply

我有这个数据框

       r2 distance
1   33.64    67866
2    8.50    77229
3   15.07   109119
4   24.35   142279 
5    7.74   143393
6    8.21   177670
7   12.26   216440
8   12.66   253751
9   26.31   282556
10  39.08   320816

我需要为每个行块计算列r2的平均值,其中列distance中两个值之间的距离等于或小于100000。 对于此示例,所需的输出将是:

  mean_r2 diff_of_distance
1   17.86            75527 ## mean of rows 1 to 5; distance 5 - distance 1
2   13.91            66164 ## mean of rows 2 to 5; distance 5 - distance 2
3   13.84            68551 ## mean of rows 3 to 6; distance 6 - distance 3
4   13.14            74161 ## mean of rows 4 to 7; distance 7 - distance 4
5    9.40            73047 ## mean of rows 5 to 7; distance 7 - distance 5
6   11.04            76081 ## mean of rows 6 to 8; distance 8 - distance 6

等等。

编辑1:我有超过100,000行。

感谢。

2 个答案:

答案 0 :(得分:0)

您可以尝试:

# your data
d <- read.table(text="r2 distance
1   33.64    67866
           2    8.50    77229
           3   15.07   109119
           4   24.35   142279 
           5    7.74   143393
           6    8.21   177670
           7   12.26   216440
           8   12.66   253751
           9   26.31   282556
           10  39.08   320816", header=T)

library(tidyverse) #dplyr_0.7.2
d %>%
  mutate(index=1:n()) %>% add row index
  group_by(index) %>% # group by this index
  # calculate difference and find max row where diff < 100000
  mutate(max_row=max(which(.$distance - distance < 100000, arr.ind=T))) %>% 
  # calculate mean
  mutate(mean_r2=mean(.$r2[index:max_row])) %>% 
  # calculate the difference
  mutate(diff_of_distance=.$distance[max_row] - .$distance[index]) %>% 
  # unite the columns 
  unite(rows, index, max_row, sep = "-")
    # A tibble: 10 x 5
      r2 distance  rows   mean_r2 diff_of_distance
 * <dbl>    <int> <chr>     <dbl>            <int>
 1 33.64    67866   1-5 17.860000            75527
 2  8.50    77229   2-5 13.915000            66164
 3 15.07   109119   3-6 13.842500            68551
 4 24.35   142279   4-7 13.140000            74161
 5  7.74   143393   5-7  9.403333            73047
 6  8.21   177670   6-8 11.043333            76081
 7 12.26   216440   7-9 17.076667            66116
 8 12.66   253751  8-10 26.016667            67065
 9 26.31   282556  9-10 32.695000            38260
10 39.08   320816 10-10 39.080000                0

这是因为group_by子集了数据帧,因此您可以在mutate范围内访问每个组的相应distance值,并使用.$distance计算与完整向量的差异,因为这可以访问完整列无论group_by()函数如何。

答案 1 :(得分:0)

循环遍历distance的每个值,从distance向量中的值减去此值,并测试结果是否小于100000.这会创建一个布尔向量,您可以求和以标识索引距离大于100000(即bool变为FALSE)。使用此索引标识您的块,然后在每个块中取r2的平均值。

为了加快代码定义你的矢量类型和长度(避免每次迭代“增长矢量”。

means <- vector("numeric", length = nrow(df))
rows <- vector("numeric", length = nrow(df))
distance_diff <- vector("numeric", length = nrow(df))

for (i in seq_along(df$distance)) {

  dis_val <- df$distance[i] # the ith distance value
  bools <- (df$distance - dis_val) < 100000 # bool indicating if difference between i and every value in vector is less than 100000
  block_range <- sum(bools)# taking sum of bools identifies the value at which the distance becomes > 100000
  rows[i] <- paste(as.character(i), "-", as.character(block_range)) 
  means[i] <- mean(df$r2[i:block_range]) # take the mean of r2 in the range i to all rows where distance is < 100000
  distance_diff[i] <- df$distance[block_range] - dis_val # minus the distance from the value before distance is > 100000 from i

}

data.frame(mean_r2 = means, rows= rows, diff_of_distance=distance_diff)

     mean_r2    rows diff_of_distance
1  17.860000   1 - 5            75527
2  13.915000   2 - 5            66164
3  13.842500   3 - 6            68551
4  13.140000   4 - 7            74161
5   9.403333   5 - 7            73047
6  11.043333   6 - 8            76081
7  17.076667   7 - 9            66116
8  26.016667  8 - 10            67065
9  32.695000  9 - 10            38260
10 39.080000 10 - 10                0