查找共享的列信息-一个最不常见的祖先问题

时间:2019-03-07 04:04:55

标签: r dataframe

我有一个data.frame对象,该对象由类似于树的信息列组成。例如,我已经执行了一组功能(query_name)的搜索,并返回了一组潜在的匹配项(match_name)。每个匹配项都有一个关联的位置,该位置分为continentcountryregiontown

我要解决的问题是,对于给定的query_name,查找所有潜在匹配项都具有共同点的位置信息。

例如,使用以下示例数据:

query_name <- c(rep("feature1", 3), rep("feature2", 2), rep("feature3", 4))
match_name <- paste0("match", seq(1:9))
continent <- c(rep("NorthAmerica", 3), rep("NorthAmerica", 2), rep("Europe", 4))
country <- c(rep("UnitedStates", 3), rep("Canada", 2), rep("Germany", 4))
region <- c(rep("NewYork", 3), "Ontario", NA, rep("Bayern", 2), rep("Berlin", 2))
town <- c("Manhattan", "Albany", "Buffalo", "Toronto", NA, "Munich", "Nuremberg", "Berlin", "Frankfurt")

data <- data.frame(query_name, match_name, continent, country, region, town)

我们将生成此data.frame对象:

    query_name match_name    continent      country  region      town
1   feature1     match1 NorthAmerica UnitedStates NewYork Manhattan
2   feature1     match2 NorthAmerica UnitedStates NewYork    Albany
3   feature1     match3 NorthAmerica UnitedStates NewYork   Buffalo
4   feature2     match4 NorthAmerica       Canada Ontario   Toronto
5   feature2     match5 NorthAmerica       Canada    <NA>      <NA>
6   feature3     match6       Europe      Germany  Bayern    Munich
7   feature3     match7       Europe      Germany  Bayern Nuremberg
8   feature3     match8       Europe      Germany  Berlin    Berlin
9   feature3     match9       Europe      Germany  Berlin Frankfurt

我希望获得有关如何构建将产生以下结果的函数的建议。请注意,共享位置信息现在已用;分隔符进行连接和分隔。

  • 功能1仅在town信息上有所不同,因此返回的字符串包含continentregion信息。
  • 在这里,两个匹配项中regiontown
  • Feature2不会相异,因为两个匹配项之一不包含任何信息。但是,信息的缺乏与信息的价值是不同的,因此,feature2匹配唯一共有的是continentcountry
  • 功能3包含共享的continentcountry信息,但是分别包含regiontown,因此仅保留了continentcountry。 / li>

希望找到一个看起来像这样的输出文件:

query_name   location_output
feature1    NorthAmerica;UnitedStates;NewYork;
feature2    NorthAmerica;Canada;;
feature3    Europe;Germany;;

感谢您可以保留的任何建议。 干杯!

3 个答案:

答案 0 :(得分:1)

这是一个选择

library(tidyverse)
data %>%
    gather(key, val, -query_name, -match_name) %>%
    select(-match_name, -key) %>%
    group_by(query_name, val) %>%
    add_count() %>%
    group_by(query_name) %>%
    filter(n == max(n)) %>%
    summarise(location_output = paste0(unique(val[!is.na(val)]), collapse = ";"))
## A tibble: 3 x 2
#  query_name location_output
#  <fct>      <chr>
#1 feature1   NorthAmerica;UnitedStates;NewYork
#2 feature2   NorthAmerica;Canada
#3 feature3   Europe;Germany

答案 1 :(得分:0)

与@MauritsEvers的解决方案相比,它不那么优雅(它不会自动处理任意数量的级别),但可以确保每个location_output都具有四个;分隔符。

library(dplyr)
data %>%
  group_by(query_name) %>%
  summarize(continent = ifelse(n_distinct(continent) == 1, first(continent), ""),
            country = ifelse(n_distinct(country) == 1, first(country), ""),
            region = ifelse(n_distinct(region) == 1, first(region), ""),
            town = ifelse(n_distinct(town) == 1, first(town), "")) %>%
  mutate(location_output = paste(continent, country, region, town, sep = ";")) %>%
  select(query_name, location_output)

答案 2 :(得分:0)

lapply(split(data, data$query_name), function(x){
    x = x[,-(1:2)]
    r = rle(sapply(x, function(d) length(unique(d))))
    x[1, seq(r$lengths[1])]
})
#$feature1
#     continent      country  region
#1 NorthAmerica UnitedStates NewYork

#$feature2
#     continent country
#4 NorthAmerica  Canada

#$feature3
#  continent country
#6    Europe Germany