我有一个data.frame对象,该对象由类似于树的信息列组成。例如,我已经执行了一组功能(query_name
)的搜索,并返回了一组潜在的匹配项(match_name
)。每个匹配项都有一个关联的位置,该位置分为continent
,country
,region
和town
。
我要解决的问题是,对于给定的query_name
,查找所有潜在匹配项都具有共同点的位置信息。
例如,使用以下示例数据:
query_name <- c(rep("feature1", 3), rep("feature2", 2), rep("feature3", 4))
match_name <- paste0("match", seq(1:9))
continent <- c(rep("NorthAmerica", 3), rep("NorthAmerica", 2), rep("Europe", 4))
country <- c(rep("UnitedStates", 3), rep("Canada", 2), rep("Germany", 4))
region <- c(rep("NewYork", 3), "Ontario", NA, rep("Bayern", 2), rep("Berlin", 2))
town <- c("Manhattan", "Albany", "Buffalo", "Toronto", NA, "Munich", "Nuremberg", "Berlin", "Frankfurt")
data <- data.frame(query_name, match_name, continent, country, region, town)
我们将生成此data.frame对象:
query_name match_name continent country region town
1 feature1 match1 NorthAmerica UnitedStates NewYork Manhattan
2 feature1 match2 NorthAmerica UnitedStates NewYork Albany
3 feature1 match3 NorthAmerica UnitedStates NewYork Buffalo
4 feature2 match4 NorthAmerica Canada Ontario Toronto
5 feature2 match5 NorthAmerica Canada <NA> <NA>
6 feature3 match6 Europe Germany Bayern Munich
7 feature3 match7 Europe Germany Bayern Nuremberg
8 feature3 match8 Europe Germany Berlin Berlin
9 feature3 match9 Europe Germany Berlin Frankfurt
我希望获得有关如何构建将产生以下结果的函数的建议。请注意,共享位置信息现在已用;
分隔符进行连接和分隔。
town
信息上有所不同,因此返回的字符串包含continent
至region
信息。region
或town
的continent
和country
。continent
和country
信息,但是分别包含region
和town
,因此仅保留了continent
和country
。 / li>
希望找到一个看起来像这样的输出文件:
query_name location_output
feature1 NorthAmerica;UnitedStates;NewYork;
feature2 NorthAmerica;Canada;;
feature3 Europe;Germany;;
感谢您可以保留的任何建议。 干杯!
答案 0 :(得分:1)
这是一个选择
library(tidyverse)
data %>%
gather(key, val, -query_name, -match_name) %>%
select(-match_name, -key) %>%
group_by(query_name, val) %>%
add_count() %>%
group_by(query_name) %>%
filter(n == max(n)) %>%
summarise(location_output = paste0(unique(val[!is.na(val)]), collapse = ";"))
## A tibble: 3 x 2
# query_name location_output
# <fct> <chr>
#1 feature1 NorthAmerica;UnitedStates;NewYork
#2 feature2 NorthAmerica;Canada
#3 feature3 Europe;Germany
答案 1 :(得分:0)
与@MauritsEvers的解决方案相比,它不那么优雅(它不会自动处理任意数量的级别),但可以确保每个location_output
都具有四个;
分隔符。
library(dplyr)
data %>%
group_by(query_name) %>%
summarize(continent = ifelse(n_distinct(continent) == 1, first(continent), ""),
country = ifelse(n_distinct(country) == 1, first(country), ""),
region = ifelse(n_distinct(region) == 1, first(region), ""),
town = ifelse(n_distinct(town) == 1, first(town), "")) %>%
mutate(location_output = paste(continent, country, region, town, sep = ";")) %>%
select(query_name, location_output)
答案 2 :(得分:0)
lapply(split(data, data$query_name), function(x){
x = x[,-(1:2)]
r = rle(sapply(x, function(d) length(unique(d))))
x[1, seq(r$lengths[1])]
})
#$feature1
# continent country region
#1 NorthAmerica UnitedStates NewYork
#$feature2
# continent country
#4 NorthAmerica Canada
#$feature3
# continent country
#6 Europe Germany