我已经获得了一组国家/地区组,我试图获得一组相互排斥的区域,以便我可以对它们进行比较。问题是我的数据包含几个组,其中许多组重叠。如何获得一组包含所有国家/地区但又不相互重叠的组?
例如,假设这是世界上的国家/地区列表:
World <- c("Angola", "France", "Germany", "Australia", "New Zealand")
假设这是我的一组:
df <- data.frame(group = c("Africa", "Western Europe", "Europe", "Europe", "Oceania", "Oceania", "Commonwealth Countries"),
element = c("Angola", "France", "Germany", "France", "Australia", "New Zealand", "Australia"))
group element
1 Africa Angola
2 Western Europe France
3 Europe Germany
4 Europe France
5 Oceania Australia
6 Oceania New Zealand
7 Commonwealth Countries Australia
如何删除重叠的组(在本例中为西欧)以获取包含以下所有国家/地区的一组组:
df_solved <- data.frame(group = c("Africa", "Europe", "Europe", "Oceania", "Oceania"),
element = c("Angola", "France", "Germany", "Australia", "New Zealand"))
group element
1 Africa Angola
2 Europe France
3 Europe Germany
4 Oceania Australia
5 Oceania New Zealand
答案 0 :(得分:3)
一个可能的规则可能是最小化组的数量,例如将元素与包含最多元素的组相关联。
library(data.table)
setDT(df)[, n.elements := .N, by = group][
order(-n.elements), .(group = group[1L]), by = element]
element group 1: Germany Europe 2: France Europe 3: Australia Oceania 4: New Zealand Oceania 5: Angola Africa
setDT(df)[, n.elements := .N, by = group][]
返回
group element n.elements 1: Africa Angola 1 2: Western Europe France 1 3: Europe Germany 2 4: Europe France 2 5: Oceania Australia 2 6: Oceania New Zealand 2 7: Commonwealth Countries Australia 1
现在,通过减少元素数量来排序行,并且对于每个国家,选择第一个,即“最大”的组。这应按要求为每个国家/地区返回一个组。 如果是关系,即一个组包含相同数量的元素,您可以在订购时添加额外的citeria,例如,组名的长度,或只是按字母顺序排列。
答案 1 :(得分:2)
1)如果您想简单地消除重复元素,请使用!duplicated(...)
,如图所示。没有包使用。
subset(df, !duplicated(element))
,并提供:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
2)设置分区如果每个组必须完全进入或完全出去,并且每个元素只能出现一次,则这是一个设置分区问题:
library(lpSolve)
const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, "=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))
,并提供:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
3)设置覆盖当然可能没有确切的设置分区,所以我们可以考虑设置覆盖问题(在lp行中,相同的代码“=”被“&gt; =”替换
library(lpSolve)
const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, ">=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))
,并提供:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
然后我们可以选择应用(1)删除封面中的任何重复项。
4)非支配组另一种方法是删除其元素构成其他组元素的严格子集的任何组。例如,西欧的每个元素都在欧洲,而欧洲的元素比西欧更多,因此西欧的元素是欧洲元素的严格子集,我们将西欧移除。使用上面的const.mat
:
# returns TRUE if jth column of const.mat is dominated by some other column
is_dom_fun <- function(j) any(apply(const.mat[, j] <= const.mat[, -j], 2, all) &
sum(const.mat[, j]) < colSums(const.mat[, -j]))
is_dom <- sapply(seq_len(ncol(const.mat)), is_dom_fun)
subset(df, group %in% colnames(const.mat)[!is_dom])
,并提供:
group element
1 Africa Angola
3 Europe Germany
4 Europe France
5 Oceania Australia
6 Oceania New Zealand
如果有任何重复,我们可以使用(1)删除它们。
答案 2 :(得分:1)
library(dplyr)
df %>% distinct(element, .keep_all=TRUE)
group element
1 Africa Angola
2 Europe France
3 Europe Germany
4 Oceania Australia
5 Oceania New Zealand
向Axeman致敬,用这个答案击败我。
<强>更新强>
你的问题不明确。为什么'欧洲'比'西欧'更受欢迎?换句话说,每个国家都分配了几个小组。您希望将其减少到每个国家/地区的一个组。你如何决定哪个组?
这是一种方式,我们总是喜欢最大的:
groups <- df %>% count(group)
df %>% inner_join(groups, by='group') %>%
arrange(desc(n)) %>% distinct(elemenet, .keep_all=TRUE)
group element n
1 Europe France 2
2 Europe Germany 2
3 Oceania Australia 2
4 Oceania New Zealand 2
5 Africa Angola 1
答案 3 :(得分:0)
以下是data.table
library(data.table)
setDT(df)[, head(.SD, 1), element]
或unique
unique(setDT(df), by = 'element')
# group element
#1: Africa Angola
#2: Europe France
#3: Europe Germany
#4: Oceania Australia
#5: Oceania New Zealand
使用了包,它是data.table
答案 4 :(得分:0)
完全不同的方法是忽略给定的群组,但只查找联合国地区目录中的国家/地区名称,这些国家/地区名称位于countrycodes
或ISOcodes
包中。
countrycodes
软件包似乎提供了更简单的界面,并且还警告了在其数据库中找不到的国家/地区名称:
# given country names - note the deliberately misspelled last entry
World <- c("Angola", "France", "Germany", "Australia", "New Zealand", "New Sealand")
# regions
countrycode::countrycode(World, "country.name.en", "region")
[1] "Middle Africa" "Western Europe" "Western Europe" "Australia and New Zealand" [5] "Australia and New Zealand" NA Warning message: In countrycode::countrycode(World, "country.name.en", "region") : Some values were not matched unambiguously: New Sealand
# continents
countrycode::countrycode(World, "country.name.en", "continent")
[1] "Africa" "Europe" "Europe" "Oceania" "Oceania" NA Warning message: In countrycode::countrycode(World, "country.name.en", "continent") : Some values were not matched unambiguously: New Sealand