部分dplyr回答

Question

Assume we have two data frames data1 and data2, both with same columns, e.g.

 > head(data1)
  ID Region Age Label
1  1     CC  20     0
2  2     BB  20     1
3  3     AA  40     0
4  4     BB  60     1
5  5     BB  40     0
6  6     BB  40     1

Assume all features are factors (except ID).

Question: How do I take a representative sample from data2 based on data1? E.g. based on product frequency of data1, see below: take 6 sample with Region:AA, Age:20, Label:0, take 1 sample with Region:AA, Age:20, Label:1, etc...

> head(count(data1, c("Region", "Age", "Label")))
  Region Age Label freq
1     AA  20     0    6
2     AA  20     1    1
3     AA  40     0    3
4     AA  40     1    5
5     AA  60     0    5
6     AA  60     1    3

I was looking at the sampling package as well as dplyr package. But I can't get my head around it. Formally, I am looking for a way for stratified sampling from data2 based on distribution of features in data1.

Thank you.

Edit: First, credit goes to @Jesse Tweedle for his concise answer below using dplyr. Here an alternative partial solution using libraries sampling (function strata) and data.table is presented:

library(sampling)
library(data.table)

d1 <- data.frame(ID = 1:100, 
                 region = sample(c("AA", "BB", "CC"), 100, replace = TRUE), 
                 age = sample(c(20,40,60),100,replace = TRUE), 
                 label = sample(c(0,1), 100, replace = TRUE))
d1.table = as.data.table(d1)

d2 <- data.frame(ID = 1:1000, 
                 region = sample(c("AA", "BB", "CC"), 1000, replace = TRUE), 
                 age = sample(c(20,40,60),1000,replace = TRUE), 
                 label = sample(c(0,1), 1000, replace = TRUE))
d2.table = as.data.table(d2)

#Sort
setkey(d1.table, region, age)
setkey(d2.table, region, age)

d1.table.freq = d1.table[,.N,keyby = list(region, age)]

d2.sample = data.table(strata(d2.table,
                              c("region", "age"),
                              d1.table.freq$N,
                              "srswor")) # random sampling without replacement

Of course this implies that all combinations of features which appear in d1 (i.e. are not 0) have to appear in d2 and the other way around. From that point of view it is not a general solution but a partial one.

Answer 1

部分dplyr回答

这里有一些假数据，带有counts数据集：

data1 <- tibble(id = 1:30,
                region = sample(letters[1:3], 30, replace = TRUE),
                label = sample(0:1, 30, replace = TRUE))
counts <- data1 %>% group_by(region, label) %>% count()

data2 <- tibble(id = 1:300,
                region = sample(letters[1:3], 300, replace = TRUE),
                label = sample(0:1, 300, replace = TRUE))

sample_n通常会对此有所帮助，但不会为每个组分别提出size个参数。因此，我们使用counts（split将region, label数据集map加入sample_n个变量，size = n n到每个列表}}来自count），然后使用bind_rows将数据帧列表重新组合在一起：

data2 %>%
  left_join(counts) %>%
  split(list(data2$region, data2$label)) %>%
  map(~ sample_n(.x, size = unique(.x$n))) %>%
  bind_rows()

如果您的数据集差别很大，则可能需要在replace = TRUE函数中使用sample_n。

Sample based on a data frame in R

1 个答案:

部分dplyr回答