与dplyr

时间:2016-05-09 18:58:12

标签: r dplyr

我有df:

df <- data.frame(group = c(rep("G1",18), rep("G2", 10)), X = c(rep("a", 10), rep("b", 8), rep("c", 4), rep("d", 6)), Y = c(rep(1:10), rep(1:8), rep(1:4), rep(1:6)))

可能通过使用dplyrtidyr,我想使每个group内的所有子组长度相同,这应该是其中最小的一个。群组。 简而言之,结果数据框应为:

df_r <- data.frame(group = c(rep("G1",16), rep("G2", 8)), X = c(rep("a", 8), rep("b", 8), rep("c", 4), rep("d", 4)), Y = c(rep(1:8), rep(1:8), rep(1:4), rep(1:4)))

我无法专注于如何实现这一目标。任何帮助将不胜感激。

5 个答案:

答案 0 :(得分:4)

这可能是你想要的?

style(){

var link = document.getElementsByTagName("link")[0];
console.log(link.href);
link.href = "style2.css"; // or something else
console.log(link.href);

}

答案 1 :(得分:1)

以下是使用data.table

的其他选项
library(data.table)
setDT(df)[, {
        i1 <- tabulate(factor(X))
        i2 <- sequence(pmin(i1, min(i1)))
        .SD[Y %in% i2] } , by = .(group)]
#    group X Y
# 1:    G1 a 1
# 2:    G1 a 2
# 3:    G1 a 3
# 4:    G1 a 4
# 5:    G1 a 5
# 6:    G1 a 6
# 7:    G1 a 7
# 8:    G1 a 8
# 9:    G1 b 1
#10:    G1 b 2
#11:    G1 b 3
#12:    G1 b 4
#13:    G1 b 5
#14:    G1 b 6
#15:    G1 b 7
#16:    G1 b 8
#17:    G2 c 1
#18:    G2 c 2
#19:    G2 c 3
#20:    G2 c 4
#21:    G2 d 1
#22:    G2 d 2
#23:    G2 d 3
#24:    G2 d 4

答案 2 :(得分:1)

我就是这样做的:

library(data.table)
setDT(df)[, size := .N, by = .(group, X)][
          , size := min(size), by = group][
          , head(.SD, size[1]), by = .(group, X)]
#    group X Y size
# 1:    G1 a 1    8
# 2:    G1 a 2    8
# 3:    G1 a 3    8
# 4:    G1 a 4    8
# 5:    G1 a 5    8
# 6:    G1 a 6    8
# 7:    G1 a 7    8
# 8:    G1 a 8    8
# 9:    G1 b 1    8
#10:    G1 b 2    8
#11:    G1 b 3    8
#12:    G1 b 4    8
#13:    G1 b 5    8
#14:    G1 b 6    8
#15:    G1 b 7    8
#16:    G1 b 8    8
#17:    G2 c 1    4
#18:    G2 c 2    4
#19:    G2 c 3    4
#20:    G2 c 4    4
#21:    G2 d 1    4
#22:    G2 d 2    4
#23:    G2 d 3    4
#24:    G2 d 4    4
#    group X Y size

答案 3 :(得分:0)

这是一个相当丑陋的基础R答案:

# get minimum numbers by group
minCntGroup <- aggregate(Y~group, data=aggregate(Y~group+X, data=df, FUN=length), FUN=min)

# sample indices of df from each group returned as a list,
# using minCntGroup to sample correct size
set.seed(1234)
mySampleVector <- unlist(sapply(unique(levels(df$X)), function(i) 
                         sample(which(df$X == i),
                         size=minCntGroup[minCntGroup$group %in% df[df$X==i,"group"], "Y"])))

sapply返回一个列表,其中包含每个X子组的采样行的索引,保持较大的组变量中的大小相同。我在unlist中包含此列表以返回向量。

如果要将其转换为data.frame,可以使用

df_r <- df[mySampleVector,]

答案 4 :(得分:0)

在对其中一个答案的评论之后,这是变量不连续且会推广到其他数据的解决方案:

out <- df %>% 
  group_by(group, X) %>% 
  mutate(subgroup_size = n()) %>% 
  group_by(group) %>% 
  mutate(min_subgroup_size = min(subgroup_size)) %>% 
  group_by(group, X) %>% 
  filter(row_number() <= min_subgroup_size) %>% 
  dplyr::select(-c(subgroup_size, min_subgroup_size)) %>%
  ungroup()

table(out$group, out$X)
     a b c d
  G1 8 8 0 0
  G2 0 0 4 4

此解决方案使用3个分组步骤来获得请求的结果:

  1. 第一个分组依据(组和X)确定子组大小
  2. 考虑到一个组中的所有子组,下一个组在组上一级升级以获得最小子组大小
  3. 最后,再次按(组和X)进行分组,并使用之前确定的最小子组大小来过滤每个子组的适当行数。

(可选)将filter(row_number() <= min_subgroup_size)替换为sample_n(min_group_size),以在子组中随机选择行。