Assume we have two data frames data1 and data2, both with same columns, e.g.
> head(data1)
ID Region Age Label
1 1 CC 20 0
2 2 BB 20 1
3 3 AA 40 0
4 4 BB 60 1
5 5 BB 40 0
6 6 BB 40 1
Assume all features are factors (except ID).
Question: How do I take a representative sample from data2 based on data1? E.g. based on product frequency of data1, see below: take 6 sample with Region:AA, Age:20, Label:0, take 1 sample with Region:AA, Age:20, Label:1, etc...
> head(count(data1, c("Region", "Age", "Label")))
Region Age Label freq
1 AA 20 0 6
2 AA 20 1 1
3 AA 40 0 3
4 AA 40 1 5
5 AA 60 0 5
6 AA 60 1 3
I was looking at the sampling
package as well as dplyr
package. But I can't get my head around it. Formally, I am looking for a way for stratified sampling from data2 based on distribution of features in data1.
Thank you.
Edit: First, credit goes to @Jesse Tweedle for his concise answer below using dplyr
. Here an alternative partial solution using libraries sampling
(function strata
) and data.table
is presented:
library(sampling)
library(data.table)
d1 <- data.frame(ID = 1:100,
region = sample(c("AA", "BB", "CC"), 100, replace = TRUE),
age = sample(c(20,40,60),100,replace = TRUE),
label = sample(c(0,1), 100, replace = TRUE))
d1.table = as.data.table(d1)
d2 <- data.frame(ID = 1:1000,
region = sample(c("AA", "BB", "CC"), 1000, replace = TRUE),
age = sample(c(20,40,60),1000,replace = TRUE),
label = sample(c(0,1), 1000, replace = TRUE))
d2.table = as.data.table(d2)
#Sort
setkey(d1.table, region, age)
setkey(d2.table, region, age)
d1.table.freq = d1.table[,.N,keyby = list(region, age)]
d2.sample = data.table(strata(d2.table,
c("region", "age"),
d1.table.freq$N,
"srswor")) # random sampling without replacement
Of course this implies that all combinations of features which appear in d1 (i.e. are not 0) have to appear in d2 and the other way around. From that point of view it is not a general solution but a partial one.
答案 0 :(得分:1)
这里有一些假数据,带有counts
数据集:
data1 <- tibble(id = 1:30,
region = sample(letters[1:3], 30, replace = TRUE),
label = sample(0:1, 30, replace = TRUE))
counts <- data1 %>% group_by(region, label) %>% count()
data2 <- tibble(id = 1:300,
region = sample(letters[1:3], 300, replace = TRUE),
label = sample(0:1, 300, replace = TRUE))
sample_n
通常会对此有所帮助,但不会为每个组分别提出size
个参数。因此,我们使用counts
(split
将region, label
数据集map
加入sample_n
个变量,size = n
n
到每个列表}}来自count),然后使用bind_rows
将数据帧列表重新组合在一起:
data2 %>%
left_join(counts) %>%
split(list(data2$region, data2$label)) %>%
map(~ sample_n(.x, size = unique(.x$n))) %>%
bind_rows()
如果您的数据集差别很大,则可能需要在replace = TRUE
函数中使用sample_n
。