我有一个描述公司所有权级别的数据框,如下所示:
Company Subsidiary1 Subsidiary2 Subsidiary3
DE5930 DE5931 NA NA
GB3489 GB3490 NA NA
GB3489 GB3490 GB3491 NA
US2036 US2037 NA NA
US2036 US2037 US2038 NA
US2036 US2037 US2038 GB3491
....# and so on
现在,我想为每个公司在所有子公司中创建一列,如下所示:
Company Subsidiaries
DE5930 DE5931
GB3489 GB3490
GB3489 GB3491
US2036 US2037
US2036 US2038
US2036 GB3491
数据集确实很大(超过100.000行),我无法使用group_by
或aggregate
函数找到任何解决方案,因为大多数示例都是针对数字变量(例如,平均值)。
一个想法是删除带有df[ !duplicated(df$Subsidiary1), ]
的重复项,以保留每个子公司的首次出现,然后将值向左移动,但是问题是一个子公司可能属于多个公司(例如“ GB3491 ”),而我不想放弃这些意见。有没有解决这个问题的好方法?
提前谢谢!
答案 0 :(得分:0)
我建议使用下一种tidyverse
方法:
library(tidyverse)
#Data
df <- structure(list(Company = c("DE5930", "GB3489", "GB3489", "US2036",
"US2036", "US2036"), Subsidiary1 = c("DE5931", "GB3490", "GB3490",
"US2037", "US2037", "US2037"), Subsidiary2 = c(NA, NA, "GB3491",
NA, "US2038", "US2038"), Subsidiary3 = c(NA, NA, NA, NA, NA,
"GB3491")), class = "data.frame", row.names = c(NA, -6L))
代码:
df %>% pivot_longer(cols = -Company) %>% select(-name) %>%
filter(!is.na(value)) %>%
filter(!duplicated(paste(Company,value)))
输出:
# A tibble: 6 x 2
Company value
<chr> <chr>
1 DE5930 DE5931
2 GB3489 GB3490
3 GB3489 GB3491
4 US2036 US2037
5 US2036 US2038
6 US2036 GB3491
答案 1 :(得分:0)
我们可以使用coalesce
library(dplyr)
df1 %>%
transmute(Company, Subsidiaries =
coalesce(!!! rlang::syms(rev(names(df1)[-1]))))
# Company Subsidiaries
#1 DE5930 DE5931
#2 GB3489 GB3490
#3 GB3489 GB3491
#4 US2036 US2037
#5 US2036 US2038
#6 US2036 GB3491
或者通过base R
使用max.col
cbind(df1[1], Subsidiaries = df1[-1][cbind(seq_len(nrow(df1)),
max.col(!is.na(df1[-1]), "last"))])
df1 <- structure(list(Company = c("DE5930", "GB3489", "GB3489", "US2036",
"US2036", "US2036"), Subsidiary1 = c("DE5931", "GB3490", "GB3490",
"US2037", "US2037", "US2037"), Subsidiary2 = c(NA, NA, "GB3491",
NA, "US2038", "US2038"), Subsidiary3 = c(NA, NA, NA, NA, NA,
"GB3491")), class = "data.frame", row.names = c(NA, -6L))