Question

我正在尝试使用循环函数来减少数据集的长度。我试图从我的数据框中的四个子组中的每个子组中均等地采样（所有相等的长度）。我在编写能够从每个子组中采样n-1行的代码时遇到问题，其中n表示子组的当前长度。我目前的代码如下：

sub.df<- function(x){
  library(data.table)
  library(tidyverse)
  setDT(x)
  while(nrow(x) > 24) { 
    x.1 <- x %>% # this is the beginning of the sample part
      group_by(x$spiral) %>% 
      tally() %>% select(-n) %>%
      sample_n(x, nrow(x)-1, replace = FALSE) #this is where I have trouble
    ks <- ks.test(dist(x[,c(1,2)]), unif.null) #this part is for evaluating the exclusions
    ks.1 <- ks.test(dist(x.1[,c(1,2)]), unif.null)
    if(ks.1$statistic > ks$statistic) {x <- x.1} else {x <- x}
  }

}

数据的一个例子：

x.cord  y.cord  subgroup
1       1       1
1       4       1
3       5       1
2       1       1
2       -3      2
3       -1      2
3       -2      2
1       -3      2
-2      -2      3
-4      -1      3
-5      -5      3
-2      -1      3
-3      4       4
-1      1       4
-2      5       4
-4      3       4

现在，如果循环正确运行，第一个实例将从每个子组中采样3（4-1），然后是2（3-1），然后是1（2-1）。所以我的最终数据将是：

x.cord   y.cord   subgroup
3        5        1
1        -3       2
-5       -5       3
-4       3        4

根据我提供的代码，我的实际数据集将有24个点，每个子组有6个，但这应该有希望说明我想要做的事情。

Answer 1

在较高级别，我知道我想使用group_by()和filter()

group_by(x, subgroup) %>% filter(predicate_n_minus_1(subgroup))

因此，挑战在于编写和测试predicate_n_minus_1()。我想出了

predicate_n_minus_1 <- function(x)
    seq_along(x) %in% sample(length(x) - 1)

这很容易测试，包括零和一长度子组的重要边缘情况

library(testthat)
expect_equal(predicate_n_minus_1(integer()), logical())        # length 0
expect_equal(predicate_n_minus_1(integer(1)), FALSE)           # length 1
expect_equal(length(predicate_n_minus_1(integer(5))), 5)       # length isomorphism
expect_equal(sum(predicate_n_minus_1(integer(5))), 4)          # n - 1
expect_equal(sum(predicate_n_minus_1(letters)), length(letters) - 1) # other types!

我知道这不是一个纯粹的tidyverse解决方案，但它似乎比MKR的答案中的嵌套函数调用更清晰，更容易测试，更容易修改。也许有一个整体解决方案，类似地将整体数据操作与过滤器规范分开？

Answer 2

在我看来，您没有正确使用sample_n。函数group_size可以帮助您查找组的大小。假设所有组都具有相同的大小，您可以在函数中替换您的select语句，如下所示。

允许。首先说明，这个子采样将如何工作。一旦验证，OP可以将其用作功能的一部分。

使用min(group_size(group_by(.,subgroup)))-1将确保将1小于具有最少行的组进行采样。

library(tidyverse)
x %>% # this is the beginning of the sample part
  group_by(subgroup) %>%  # This will ensure that equal selection from each group
  sample_n(.,min(group_size(group_by(.,subgroup)))-1, replace = FALSE)

#Result - 3 from each subgroup has been selected. 

# # A tibble: 12 x 3
# # Groups: subgroup [4]
# x.cord y.cord subgroup
# <int>  <int>    <int>
# 1      1      1        1
# 2      3      5        1
# 3      2      1        1
# 4      2     -3        2
# 5      3     -1        2
# 6      1     -3        2
# 7     -4     -1        3
# 8     -2     -1        3
# 9     -5     -5        3
# 10     -4      3        4
# 11     -2      5        4
# 12     -3      4        4

现在，由于上面已经进行了验证，因此请修改功能。

注意：未测试功能。请求OP使用实际数据进行测试。

# modified function should be as
sub.df<- function(x){
  library(tidyverse)
  while(nrow(x) > 24) { 
    x.1 <- x %>% # this is the beginning of the sample part
      group_by(spiral) %>% 
      sample_n(.,min(group_size(group_by(.,spiral)))-1, replace = FALSE)
    ks <- ks.test(dist(x[,c(1,2)]), unif.null) #this part is for evaluating the exclusions
    ks.1 <- ks.test(dist(x.1[,c(1,2)]), unif.null)
    if(ks.1$statistic > ks$statistic) {x <- x.1} else {x <- x}
  }
  x
}

数据：

x <- read.table(text = "x.cord y.cord subgroup 1 1 1 1 4 1 3 5 1 2 1 1 2 -3 2 3 -1 2 3 -2 2 1 -3 2 -2 -2 3 -4 -1 3 -5 -5 3 -2 -1 3 -3 4 4 -1 1 4 -2 5 4 -4 3 4", header = TRUE)

创建一个循环以取样n-1行

2 个答案: