如何按组汇总/列出多个列的值?

时间:2020-08-21 18:02:43

标签: r dataframe grouping

我有一个描述公司所有权级别的数据框,如下所示:

Company   Subsidiary1    Subsidiary2    Subsidiary3
DE5930      DE5931           NA             NA
GB3489      GB3490           NA             NA
GB3489      GB3490         GB3491           NA
US2036      US2037           NA             NA
US2036      US2037         US2038           NA
US2036      US2037         US2038         GB3491
....# and so on

现在,我想为每个公司在所有子公司中创建一列,如下所示:

Company   Subsidiaries
DE5930     DE5931          
GB3489     GB3490
GB3489     GB3491
US2036     US2037
US2036     US2038       
US2036     GB3491

数据集确实很大(超过100.000行),我无法使用group_byaggregate函数找到任何解决方案,因为大多数示例都是针对数字变量(例如,平均值)。

一个想法是删除带有df[ !duplicated(df$Subsidiary1), ]的重复项,以保留每个子公司的首次出现,然后将值向左移动,但是问题是一个子公司可能属于多个公司(例如“ GB3491 ”),而我不想放弃这些意见。有没有解决这个问题的好方法?

提前谢谢!

2 个答案:

答案 0 :(得分:0)

我建议使用下一种tidyverse方法:

library(tidyverse)
#Data
df <- structure(list(Company = c("DE5930", "GB3489", "GB3489", "US2036", 
"US2036", "US2036"), Subsidiary1 = c("DE5931", "GB3490", "GB3490", 
"US2037", "US2037", "US2037"), Subsidiary2 = c(NA, NA, "GB3491", 
NA, "US2038", "US2038"), Subsidiary3 = c(NA, NA, NA, NA, NA, 
"GB3491")), class = "data.frame", row.names = c(NA, -6L))

代码:

df %>% pivot_longer(cols = -Company) %>% select(-name) %>%
  filter(!is.na(value)) %>%
  filter(!duplicated(paste(Company,value)))

输出:

# A tibble: 6 x 2
  Company value 
  <chr>   <chr> 
1 DE5930  DE5931
2 GB3489  GB3490
3 GB3489  GB3491
4 US2036  US2037
5 US2036  US2038
6 US2036  GB3491

答案 1 :(得分:0)

我们可以使用coalesce

library(dplyr)
df1 %>%
    transmute(Company, Subsidiaries = 
        coalesce(!!! rlang::syms(rev(names(df1)[-1]))))
#  Company Subsidiaries
#1  DE5930       DE5931
#2  GB3489       GB3490
#3  GB3489       GB3491
#4  US2036       US2037
#5  US2036       US2038
#6  US2036       GB3491

或者通过base R使用max.col

cbind(df1[1], Subsidiaries =  df1[-1][cbind(seq_len(nrow(df1)), 
         max.col(!is.na(df1[-1]), "last"))])

数据

df1 <- structure(list(Company = c("DE5930", "GB3489", "GB3489", "US2036", 
"US2036", "US2036"), Subsidiary1 = c("DE5931", "GB3490", "GB3490", 
"US2037", "US2037", "US2037"), Subsidiary2 = c(NA, NA, "GB3491", 
NA, "US2038", "US2038"), Subsidiary3 = c(NA, NA, NA, NA, NA, 
"GB3491")), class = "data.frame", row.names = c(NA, -6L))