r将来自多个列的数据折叠为一个

时间:2017-12-12 00:08:11

标签: r dplyr aggregate collapse

我知道关于这个主题有很多问题,所以如果这是一个重复的问题我会道歉。我试图将数据集中的多个列折叠为一列:

假设这是我正在使用的数据集的结构,

df <- data.frame(
      cbind(
      variable_1 = c('Var1', NA, NA,'Var1'),
      variable_2 = c('Var2', 'No', NA, NA),
      variable_3 = c(NA, NA, 'Var3', NA),
      variable_4 = c(NA, 'Var4', NA, NA),
      variable_5 = c(NA, 'No', 'Var5', NA),
      variable_6 = c(NA, NA, 'Var6', NA)

    ))

 variable_1  variable_2  variable_3  variable_4  variable_5  variable_6 
 Var1        Var2        NA          NA          NA          NA
 NA          No          NA          Var4        No          NA
 NA          NA          Var3        NA          Var5        Var6
 Var1        NA          NA          NA          NA          NA

  我期待的是像这样的一列variable_7

 variable_1  variable_2  variable_3  variable_4  variable_5  variable_6  variable_7
 Var1        Var2        NA          NA          NA          NA          Var1, Var2
 NA          No          NA          Var4        No          NA          Var4
 NA          NA          Var3        NA          Var5        Var6        Var3, Var5, Var6
 Var1        NA          NA          NA          NA          NA          Var1

非常感谢任何有关实现这一目标的帮助。

4 个答案:

答案 0 :(得分:4)

df$variable_7 <- apply(df, 1, function(x) paste(x[!is.na(x) & x != "No"], collapse = ", "));
df;
#  variable_1 variable_2 variable_3 variable_4 variable_5 variable_6
#1       Var1       Var2       <NA>       <NA>       <NA>       <NA>
#2       <NA>         No       <NA>       Var4         No       <NA>
#3       <NA>       <NA>       Var3       <NA>       Var5       Var6
#4       Var1       <NA>       <NA>       <NA>       <NA>       <NA>
#        variable_7
#1       Var1, Var2
#2             Var4
#3 Var3, Var5, Var6
#4             Var1

说明:使用applypaste(..., collapse = ", ")连接所有行条目(NA"No"除外)并存储在新列variable_7中。< / p>

样本数据

df <- data.frame(
      cbind(
      variable_1 = c('Var1', NA, NA,'Var1'),
      variable_2 = c('Var2', 'No', NA, NA),
      variable_3 = c(NA, NA, 'Var3', NA),
      variable_4 = c(NA, 'Var4', NA, NA),
      variable_5 = c(NA, 'No', 'Var5', NA),
      variable_6 = c(NA, NA, 'Var6', NA)

    ))

答案 1 :(得分:2)

我认为如果有n行,那么objective就是在每行中创建一个包含字符Var的逗号分隔字符串的n向量。 (如果您打算使用其他标准来分隔所需和不需要的值,请相应地更改grep

apply(df, 1, function(x) toString(grep("Var", x, value = TRUE)))
## [1] "Var1, Var2"       "Var4"             "Var3, Var5, Var6" "Var1"         

答案 2 :(得分:1)

使用data.table'重新塑造'方法而不是循环/应用

library(data.table)
setDT(df)

df[, id := .I][
    melt(df, id.vars = "id")[grepl("Var", value), .(variable_7 = paste0(value, collapse = ",")), by = .(id)]
    , on = "id"
    , nomatch = 0
    ][order(id)]


#    variable_1 variable_2 variable_3 variable_4 variable_5 variable_6 id     variable_7
# 1:       Var1       Var2         NA         NA         NA         NA  1      Var1,Var2
# 2:         NA         No         NA       Var4         No         NA  2           Var4
# 3:         NA         NA       Var3         NA       Var5       Var6  3 Var3,Var5,Var6
# 4:       Var1         NA         NA         NA         NA         NA  4           Var1

答案 3 :(得分:1)

使用dplyr的解决方案。 df4是最终输出。请查看我是如何创建数据框df的。 cbind不是必需的,添加stringsAsFactors = FALSE以阻止创建因子列会很棒。

library(dplyr)
library(tidyr)

df2 <- df %>% mutate(ID = 1:n()) 

df3 <- df2 %>%
  gather(Variable, Value, -ID, na.rm = TRUE) %>%
  filter(!Value %in% "No") %>%
  group_by(ID) %>%
  summarise(variable_7 = toString(Value))

df4 <- df2 %>% 
  left_join(df3, by = "ID") %>%
  select(-ID) 

df4
#   variable_1 variable_2 variable_3 variable_4 variable_5 variable_6       variable_7
# 1       Var1       Var2       <NA>       <NA>       <NA>       <NA>       Var1, Var2
# 2       <NA>         No       <NA>       Var4         No       <NA>             Var4
# 3       <NA>       <NA>       Var3       <NA>       Var5       Var6 Var3, Var5, Var6
# 4       Var1       <NA>       <NA>       <NA>       <NA>       <NA>             Var1

数据

df <- data.frame(
    variable_1 = c('Var1', NA, NA,'Var1'),
    variable_2 = c('Var2', 'No', NA, NA),
    variable_3 = c(NA, NA, 'Var3', NA),
    variable_4 = c(NA, 'Var4', NA, NA),
    variable_5 = c(NA, 'No', 'Var5', NA),
    variable_6 = c(NA, NA, 'Var6', NA),
    stringsAsFactors = FALSE
  )