从分组数据中选择两个随机且连续的行

时间:2018-09-27 22:40:30

标签: r dplyr

在以下数据(包含在dput中)中,我对三个人(IndIDII)进行了重复观察(经纬度)。请注意,每个人的位置数不同,并且它们由IndYear排列。

  IndIDII      IndYear  WintLat  WintLong
1 BHS_265 BHS_265-2015 47.61025 -112.7210
2 BHS_265 BHS_265-2016 47.59884 -112.7089
3 BHS_770 BHS_770-2016 42.97379 -109.0400
4 BHS_770 BHS_770-2017 42.97129 -109.0367
5 BHS_770 BHS_770-2018 42.97244 -109.0509
6 BHS_377 BHS_377-2015 43.34744 -109.4821
7 BHS_377 BHS_377-2016 43.35559 -109.4445
8 BHS_377 BHS_377-2017 43.35195 -109.4566
9 BHS_377 BHS_377-2018 43.34765 -109.4892

我想filter创建一个新的df,其中每个IndIDII具有两个连续的行。在我的较大数据集中,所有个体至少具有2个观测值(即行),每个个体具有2到4个观测值。显然,对于只有两行的个人,代码将返回仅有的两行。如果有更多数据,将随机选择第1和2行, 2和3, 3和4。行的顺序并不重要,只要它们是连续的即可(即可以返回3和4 4和3)。

一如既往,非常感谢!

Dat <- structure(list(IndIDII = c("BHS_265", "BHS_265", "BHS_770", "BHS_770", 
"BHS_770", "BHS_377", "BHS_377", "BHS_377", "BHS_377"), IndYear = c("BHS_265-2015", 
"BHS_265-2016", "BHS_770-2016", "BHS_770-2017", "BHS_770-2018", 
"BHS_377-2015", "BHS_377-2016", "BHS_377-2017", "BHS_377-2018"
), WintLat = c(47.6102519805014, 47.5988417247191, 42.9737859090909, 
42.9712914772727, 42.9724390816327, 43.3474354347826, 43.3555934579439, 
43.3519543396226, 43.3476466990291), WintLong = c(-112.720994832869, 
-112.708887595506, -109.039964727273, -109.036693522727, -109.050923061224, 
-109.482114456522, -109.444522149533, -109.45659254717, -109.489241553398
)), class = "data.frame", row.names = c(NA, -9L))

3 个答案:

答案 0 :(得分:2)

这是使用R基本函数的解决方案

> set.seed(505) # you can set whatever seed you want, I set 505 for reproducibility
> lapply(split(Dat, Dat$IndIDII), function(x) {
  ind <- sample(nrow(x))
  cons <- if(ind[1] < max(ind)){
    c(ind[1], ind[1]+1)
  } else {
    c(ind[1], ind[1]-1)
    }
  x[cons, ]
})

$`BHS_265`
  IndIDII      IndYear  WintLat  WintLong
1 BHS_265 BHS_265-2015 47.61025 -112.7210
2 BHS_265 BHS_265-2016 47.59884 -112.7089

$BHS_377
  IndIDII      IndYear  WintLat  WintLong
6 BHS_377 BHS_377-2015 43.34744 -109.4821
7 BHS_377 BHS_377-2016 43.35559 -109.4445

$BHS_770
  IndIDII      IndYear  WintLat  WintLong
3 BHS_770 BHS_770-2016 42.97379 -109.0400
4 BHS_770 BHS_770-2017 42.97129 -109.0367

答案 1 :(得分:2)

您可以使用ave。在每个组中,创建一个行索引(i <- seq_along(x))。要获取要保留的第一个行索引,请从除最后一行索引(sample(head(i, -1), 1)之外的所有行中抽取一行样本。还包括下一行(+ 0:1)。检查采样行中有哪些行索引( i %in% ...)。将结果强制返回逻辑到子数据。

Dat[as.logical(ave(Dat$IndIDII, Dat$IndIDII, FUN = function(x){
  i <- seq_along(x)
  i %in% (sample(head(i, -1), 1) + 0:1)
})), ]

#   IndIDII      IndYear  WintLat  WintLong
# 1 BHS_265 BHS_265-2015 47.61025 -112.7210
# 2 BHS_265 BHS_265-2016 47.59884 -112.7089
# 4 BHS_770 BHS_770-2017 42.97129 -109.0367
# 5 BHS_770 BHS_770-2018 42.97244 -109.0509
# 7 BHS_377 BHS_377-2016 43.35559 -109.4445
# 8 BHS_377 BHS_377-2017 43.35195 -109.4566

同样,但更简洁,data.table及其内置行索引(.I)和每组的行数(.N

library(data.table)
setDT(Dat)
Dat[Dat[ , (sample(.I[-.N], 1)) + 0:1, by = IndIDII]$V1]

答案 2 :(得分:1)

这是一种有点笨拙的tidyeval方式。肯定可以改进(如果要连续多个,该怎么办?),但可以在此应用程序中使用。您还可以在函数末尾使用select()删除行列。

Dat <- structure(list(IndIDII = c("BHS_265", "BHS_265", "BHS_770", "BHS_770", "BHS_770", "BHS_377", "BHS_377", "BHS_377", "BHS_377"), IndYear = c("BHS_265-2015", "BHS_265-2016", "BHS_770-2016", "BHS_770-2017", "BHS_770-2018", "BHS_377-2015", "BHS_377-2016", "BHS_377-2017", "BHS_377-2018"), WintLat = c(47.6102519805014, 47.5988417247191, 42.9737859090909, 42.9712914772727, 42.9724390816327, 43.3474354347826, 43.3555934579439, 43.3519543396226, 43.3476466990291), WintLong = c(-112.720994832869, -112.708887595506, -109.039964727273, -109.036693522727, -109.050923061224, -109.482114456522, -109.444522149533, -109.45659254717, -109.489241553398)), class = "data.frame", row.names = c(NA, -9L))

library(tidyverse)
set.seed(123)
sample_2_consecutive <- function(tbl, group_col){
  group_col <- enquo(group_col)
  with_rownums <- tbl %>%
    group_by(!!group_col) %>%
    mutate(row = row_number())
  rows_to_keep <- with_rownums %>%
    filter(row != max(row)) %>%
    sample_n(1) %>%
    mutate(row2 = row + 1) %>%
    gather(key, row, row, row2)
  with_rownums %>%
    semi_join(rows_to_keep, by = c(quo_name(quo(!!group_col)), "row")) %>%
    arrange(!!group_col, row) %>%
    ungroup() # %>%
  # select(-row)
}
sample_2_consecutive(Dat, IndIDII)
#> # A tibble: 6 x 5
#>   IndIDII IndYear      WintLat WintLong   row
#>   <chr>   <chr>          <dbl>    <dbl> <int>
#> 1 BHS_265 BHS_265-2015    47.6    -113.     1
#> 2 BHS_265 BHS_265-2016    47.6    -113.     2
#> 3 BHS_377 BHS_377-2017    43.4    -109.     3
#> 4 BHS_377 BHS_377-2018    43.3    -109.     4
#> 5 BHS_770 BHS_770-2016    43.0    -109.     1
#> 6 BHS_770 BHS_770-2017    43.0    -109.     2

reprex package(v0.2.0)于2018-09-27创建。