我有一个类似于以下内容的数据框:
library(tidyverse)
set.seed(4214)
df <- data.frame(value = sample(x = 1:50, 70, replace = TRUE),
group = sample(x = letters, 70, replace = TRUE),
stringsAsFactors = FALSE) %>%
as_tibble() %>%
arrange(group)
其中group
是我的分组变量,并且每个值以不同的频率出现(例如group == "a"
出现5次,group == "b"
出现6次,依此类推)。
我需要将此数据尽可能平均地分成n = 9
个子数据帧。但是,要注意的是,我无法在子集之间拆分相同的分组变量。例如,group == "b"
不能同时出现在子集1和子集2中。
n <- 9
df %>%
mutate(divider = rep(x = 1:n,
each = ceiling(nrow(.)/n),
length.out = nrow(.))) %>%
split(.$divider)
在这里,我创建了一个divider
列,希望将数据分成子集。但是group
的给定值可能具有divider
的两个不同值。因此,此处将分组变量划分为子集。我一直在尝试使用nest
和lag
来改善这一点,但到目前为止没有成功。
我知道子集的行号将不相等,但是我希望有以下类似的东西:
$`1`
# A tibble: 11 x 3
value group divider
<int> <chr> <int>
1 43 a 1
2 22 a 1
3 1 a 1
4 5 a 1
5 4 a 1
6 18 b 1
7 32 b 1
8 33 b 1
9 47 b 1
10 43 b 1
11 35 b 1
$`2`
# A tibble: 6 x 3
value group divider
<int> <chr> <int>
1 24 c 2
2 3 d 2
3 12 d 2
4 13 e 2
5 6 e 2
6 45 f 2
$`3`
...
答案 0 :(得分:1)
一种实现方法,但这取决于数据的顺序,是按组对实例进行计数,并用与所需组数最接近的整数将它们分开。
如果需要9组,则将累积频率相加并除以9。取整数并将其用作数据集的新拆分变量
dftab <- as.data.frame(table(df$group)) %>%
mutate(nobs = cumsum(Freq),
newgrouping = ceiling(nobs/9)) %>%
group_by(newgrouping ) %>%
summarise(number_obs = sum(Freq))
dftab
# A tibble: 8 x 2
newgrouping number_obs
<dbl> <int>
1 1 5
2 2 12
3 3 9
4 4 10
5 5 9
6 6 7
7 7 11
8 8 7
对于“尽可能均匀”,我们可以对各组观测值的标准差进行愚蠢的优化。在这里,依靠组变量的顺序可以帮助完成此过程。
set.seed(4214)
df <- data.frame(value = sample(x = 1:50, 70, replace = TRUE),
group = sample(x = letters, 70, replace = TRUE),
stringsAsFactors = FALSE) %>%
as_tibble() %>%
arrange(group)
store_group <- list()
store_sd <- NA_integer_
for(i in 1:1000){
dftab <- table(df$group) %>%
as.data.frame() %>%
# important step is to shuffle the group variable every iteration
mutate(group = factor(Var1, levels = df$group %>%
unique %>%
sample)) %>%
arrange(group) %>%
mutate(nobs = cumsum(Freq),
newgrouping = ceiling(nobs/9)) %>%
select(newgrouping, group, Freq)
store_group[[i]] <- dftab
df_sd <- dftab %>%
group_by(newgrouping) %>%
summarise(number_obs = sum(Freq))
store_sd[i] <- sd(df_sd$number_obs)
}
这将导致
store_group[[which.min(store_sd)]] %>%
group_by(newgrouping) %>%
summarise(number_obs = sum(Freq))
newgrouping number_obs
<dbl> <int>
1 1 9
2 2 9
3 3 9
4 4 8
5 5 9
6 6 9
7 7 8
8 8 9
其中store_group[[which.min(store_sd)]]
拥有原始数据,并具有“最佳”分组的可能(给定循环中的迭代次数),而当您按{{1}拆分数据集时,整个数据集没有相同的group
}变量
答案 1 :(得分:1)
假设您想要按字母顺序排列的解决方案,如预期输出所示;您可以将cumsum
除以所需的分割数(即9
),以改变上限和下限,并更均匀地分配组。这将导致向量x
,其中分裂指示符已分配给group
变量的每个类别。 x
单独拆分,然后给出一个列表,可使用lapply
拆分数据帧。
x <- round(cumsum(table(dat$group)) / (nrow(dat) / 9))
result <- lapply(lapply(split(x, x), names), function(i) dat[dat$group %in% i, ])
行在结果列表中的分布
t(Map(nrow, result))
# 1 2 3 4 5 6 7 8 9
# [1,] 11 6 9 8 7 7 8 7 7
> sapply(result, "[", 2)
$`1.group`
[1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b" "b"
$`2.group`
[1] "c" "d" "d" "e" "e" "f"
$`3.group`
[1] "g" "g" "g" "g" "i" "j" "j" "j" "j"
$`4.group`
[1] "k" "k" "l" "l" "l" "l" "l" "l"
$`5.group`
[1] "n" "n" "o" "p" "p" "p" "p"
$`6.group`
[1] "q" "q" "q" "q" "r" "r" "r"
$`7.group`
[1] "s" "s" "s" "t" "u" "u" "u" "v"
$`8.group`
[1] "w" "w" "w" "x" "x" "x" "x"
$`9.group`
[1] "y" "y" "y" "y" "z" "z" "z"
数据
dat <- structure(list(value = c(43L, 22L, 1L, 5L, 4L, 18L, 32L, 33L,
47L, 43L, 35L, 24L, 3L, 12L, 13L, 6L, 45L, 12L, 5L, 22L, 47L,
35L, 20L, 36L, 34L, 15L, 22L, 9L, 41L, 1L, 7L, 2L, 21L, 3L, 8L,
33L, 12L, 39L, 19L, 2L, 34L, 45L, 7L, 22L, 24L, 25L, 20L, 19L,
45L, 36L, 25L, 23L, 47L, 13L, 45L, 36L, 23L, 14L, 12L, 15L, 12L,
11L, 25L, 31L, 41L, 14L, 38L, 15L, 13L, 6L), group = c("a", "a",
"a", "a", "a", "b", "b", "b", "b", "b", "b", "c", "d", "d", "e",
"e", "f", "g", "g", "g", "g", "i", "j", "j", "j", "j", "k", "k",
"l", "l", "l", "l", "l", "l", "n", "n", "o", "p", "p", "p", "p",
"q", "q", "q", "q", "r", "r", "r", "s", "s", "s", "t", "u", "u",
"u", "v", "w", "w", "w", "x", "x", "x", "x", "y", "y", "y", "y",
"z", "z", "z")), row.names = c(6L, 21L, 50L, 66L, 69L, 15L, 36L,
46L, 48L, 62L, 67L, 34L, 18L, 54L, 31L, 51L, 3L, 7L, 9L, 24L,
39L, 55L, 8L, 11L, 27L, 29L, 59L, 70L, 19L, 23L, 40L, 45L, 52L,
68L, 26L, 43L, 44L, 16L, 38L, 63L, 65L, 10L, 49L, 56L, 61L, 1L,
13L, 64L, 22L, 35L, 47L, 4L, 25L, 33L, 53L, 37L, 14L, 17L, 60L,
2L, 5L, 12L, 57L, 28L, 32L, 41L, 42L, 20L, 30L, 58L), class = "data.frame")