Create a group variable when number of rows is not a multiple of number of groups

时间:2018-02-03 10:57:32

标签: r

I have a data frame of many companies (let's say 7 companies) and many periods (let's say 2 periods). I need to create a new column by dividing each period's company into few parts (let's say 3 parts). Now since 7 can not exactly be divided by 3, I want assign two rows to each of the first groups, and one extra row to the last group. In the following table, the 'res' column is the expected result:

Company     Period   res
1              1      11
2              1      11
3              1      12
4              1      12
5              1      13
6              1      13
7              1      13
1              2      21
2              2      21
3              2      22
4              2      22
5              2      23
6              2      23
7              2      23

2 个答案:

答案 0 :(得分:0)

据我了解,你想要分成相等的部分,并把剩下的(如果有剩余部分)放在最后一组中。以下功能就是这样,即

f1 <- function(x, parts){
  len1 <- length(x)
  i1 <- len1 %% parts
  v1 <- rep((len1 - i1)/parts, parts)
  v1[length(v1)] <- v1[length(v1)] + i1
  v2 <- rep(seq_along(v1), v1)
  return(v2)
}

#Here are some trials,

f1(seq(7), 3)
#[1] 1 1 2 2 3 3 3
f1(seq(8), 3)
#[1] 1 1 2 2 3 3 3 3
f1(seq(9), 3)
#[1] 1 1 1 2 2 2 3 3 3
f1(seq(10), 3)
#[1] 1 1 1 2 2 2 3 3 3 3

现在你需要使用split-apply方法在每个组中应用它(使用data.tabledplyr肯定会加速这个过程),即

do.call(rbind, 
    lapply(split(df, df$Period), function(i) {
      i$New_column <- paste0(i$Period, f1(i$Company, 3)); i}))

给出,

     Company Period New_column
1.1        1      1         11
1.2        2      1         11
1.3        3      1         12
1.4        4      1         12
1.5        5      1         13
1.6        6      1         13
1.7        7      1         13
2.8        1      2         21
2.9        2      2         21
2.10       3      2         22
2.11       4      2         22
2.12       5      2         23
2.13       6      2         23
2.14       7      2         23

注意:您可以在paste0中轻松添加分隔符,以区分1_1111_1

答案 1 :(得分:0)

创建公司数量(nc)和组数(nc)的函数。对于除最后一组(ng - 1)之外的所有组,每组的长度为商(nc %/% ng)。对于最后一组,长度是商加上余数(nc %% ng)。

f <- function(nc, ng){
  qu <- nc %/% ng
  rep(1:ng, c(rep(qu, ng - 1), qu + nc %% ng))
}

每个时期都这样做:

d$res2 <- ave(d$Period, d$Period, FUN = function(x) paste0(x, "_", f(7, 3)))    
d
#    Company Period res res2
# 1        1      1  11  1_1
# 2        2      1  11  1_1
# 3        3      1  12  1_2
# 4        4      1  12  1_2
# 5        5      1  13  1_3
# 6        6      1  13  1_3
# 7        7      1  13  1_3
# 8        1      2  21  2_1
# 9        2      2  21  2_1
# 10       3      2  22  2_2
# 11       4      2  22  2_2
# 12       5      2  23  2_3
# 13       6      2  23  2_3
# 14       7      2  23  2_3

这里公司的数量是硬编码的(7),但这当然可以从您的数据中计算出来。

如果余下的 分配给最后一个组,则可以使用cut

ave(d$Company, d$Period, FUN = function(x) cut(seq_along(x), 3))