joining the first n factors (with different n) in R

时间:2017-04-10 02:45:04

标签: r

A data frame contains ID, group, n (numeric), and several factor variables

ID <- c(1,2,3,4,5,6,7,8,9,10)
group <- c("m", "m", "m", "f", "f", "m", "m", "f", "f", "m")
n <- c(1,2,6,3,6,8,4,1,4,2)
b1 <- c("a", "b", "", "a", "d", "d", "a", "c", "c", "b")
b2 <- c("a", "", "e", "a", "d", "d", "a", "c", "c", "b")
b3 <- c("a", "b", "", "a", "", "d", "a", "c", "c", "b")
b4 <- c("a", "b", "e", "a", "", "d", "a", "c", "c", "b")
b5 <- c("a", "b", "e", "a", "d", "", "", "", "c", "b")
b6 <- c("a", "", "", "", "d", "d", "", "c", "c", "b")
df <- data.frame(ID, group, n, b1, b2, b3, b4, b5, b6)

I need to create a new character column (call it y).

They way to compute y is by joining the first n variables (b1,b2,b3,b4,b5,b6) and use comma to seperate them.

Note, in case a column is a blank, then remove it from the join.

For example, for ID=1, y = "a"; for ID = 2, y = "b" (not "b, "); for ID = 3, y = "e,e,e", etc.

And, the faster the code, the better.

3 个答案:

答案 0 :(得分:2)

A possible sollution, the speed might still be an issue:

df$y <- sapply(seq_len(nrow(df)), function(i){
    cvec <- head(unlist(df[i, 4:9]), df$n[i])
    cvec <- cvec[!cvec == '']
    paste(cvec, collapse = ',')
})
#    ID group n b1 b2 b3 b4 b5 b6         y
# 1   1     m 1  a  a  a  a  a  a         a
# 2   2     m 2  b     b  b  b            b
# 3   3     m 6     e     e  e        e,e,e
# 4   4     f 3  a  a  a  a  a        a,a,a
# 5   5     f 6  d  d        d  d   d,d,d,d
# 6   6     m 8  d  d  d  d     d d,d,d,d,d
# 7   7     m 4  a  a  a  a         a,a,a,a
# 8   8     f 1  c  c  c  c     c         c
# 9   9     f 4  c  c  c  c  c  c   c,c,c,c
# 10 10     m 2  b  b  b  b  b  b       b,b

答案 1 :(得分:0)

以下是使用gsubpaste的选项。我们paste'd''(do.call(paste0, df[-(1:3)])的'b'列,然后使用substring仅保留'n'列建议的字符,使用gsub在每个角色之间创建,

df$y <- gsub("(?<=\\S)(?=\\S)", ",",
           substring(do.call(paste0, df[-(1:3)]), 1, df$n), perl = TRUE)

df
#   ID group n b1 b2 b3 b4 b5 b6         y
#1   1     m 1  a  a  a  a  a  a         a
#2   2     m 2  b     b  b  b          b,b
#3   3     m 6     e     e  e        e,e,e
#4   4     f 3  a  a  a  a  a        a,a,a
#5   5     f 6  d  d        d  d   d,d,d,d
#6   6     m 8  d  d  d  d     d d,d,d,d,d
#7   7     m 4  a  a  a  a         a,a,a,a
#8   8     f 1  c  c  c  c     c         c
#9   9     f 4  c  c  c  c  c  c   c,c,c,c
#10 10     m 2  b  b  b  b  b  b       b,b

答案 2 :(得分:0)

df$y <- apply(df, 1, function(r) {
  gsub("\\s+", "\\,", trimws(paste(head(r[4:9], r["n"]), sep= " ", collapse = " ")))})
df


#    ID group n b1 b2 b3 b4 b5 b6         y
# 1   1     m 1  a  a  a  a  a  a         a
# 2   2     m 2  b     b  b  b            b
# 3   3     m 6     e     e  e        e,e,e
# 4   4     f 3  a  a  a  a  a        a,a,a
# 5   5     f 6  d  d        d  d   d,d,d,d
# 6   6     m 8  d  d  d  d     d d,d,d,d,d
# 7   7     m 4  a  a  a  a         a,a,a,a
# 8   8     f 1  c  c  c  c     c         c
# 9   9     f 4  c  c  c  c  c  c   c,c,c,c
# 10 10     m 2  b  b  b  b  b  b       b,b