我想做group_by或聚合。我有类似的东西:
> head(affiliation_clean)
Affiliation_ID Affiliation_Name City Country
1 000001 New Mexico State University Las Cruces Las Cruces United States
2 000001 New Mexico State University Las Cruces Las Cruces <NA>
3 000001 New Mexico State University Las Cruces <NA> <NA>
4 000002 Palo Alto Research Center Incorporated Palo Alto <NA>
5 000002 Palo Alto Research Center Incorporated <NA> United States
6 000002 Palo Alto Research Center Incorporated <NA> <NA>
按“Affiliation_ID”分组并取最长的“Affiliation_Name”,“City”和“Country”字符串,我想得到:
> head(affiliation_clean)
Affiliation_ID Affiliation_Name City Country
1 000001 New Mexico State University Las Cruces Las Cruces United States
2 000002 Palo Alto Research Center Incorporated Palo Alto United States
提前致谢。
答案 0 :(得分:1)
假设每个'Affiliation_ID','Affiliation_Name'都有一个unique
'城市/国家',在前两列分组后,获取所有unique
非NA元素summarise_all
library(dplyr)
affiliation_clean %>%
group_by(Affiliation_ID, Affiliation_Name) %>%
summarise_all(funs(unique(.[!is.na(.)])) )
# A tibble: 2 x 4
# Groups: Affiliation_ID [?]
# Affiliation_ID Affiliation_Name City Country
# <chr> <chr> <chr> <chr>
#1 000001 New Mexico State University Las Cruces Las Cruces United States
#2 000002 Palo Alto Research Center Incorporated Palo Alto United States
答案 1 :(得分:1)
以下是基于您的说明的dplyr
解决方案,用于选择每个Affiliation_ID
和列的最长字符串。
library(dplyr)
dat2 <- dat %>%
group_by(Affiliation_ID) %>%
summarise_all(funs(.[which.max(nchar(.))][1]))
dat2
# # A tibble: 2 x 4
# Affiliation_ID Affiliation_Name City Country
# <int> <chr> <chr> <chr>
# 1 1 New Mexico State University Las Cruces Las Cruces United States
# 2 2 Palo Alto Research Center Incorporated Palo Alto United States
数据强>
dat <-read.table(text = " Affiliation_ID Affiliation_Name City Country
1 '000001' 'New Mexico State University Las Cruces' 'Las Cruces' 'United States'
2 '000001' 'New Mexico State University Las Cruces' 'Las Cruces' NA
3 '000001' 'New Mexico State University Las Cruces' NA NA
4 '000002' 'Palo Alto Research Center Incorporated' 'Palo Alto' NA
5 '000002' 'Palo Alto Research Center Incorporated' NA 'United States'
6 '000002' 'Palo Alto Research Center Incorporated' NA NA",
header = TRUE, stringsAsFactors = FALSE)