I have a data set with import and export numbers from countries which looks basically like this:
Country_from Country_to Count Value
UK USA 5 10
France Belgium 4 7
USA UK 1 6
Belgium France 8 9
Now, I want to aggregate this data and to combine the import and export numbers by summation. So, I want my resulting dataframe to be:
Country_from Country_to Count Value
UK USA 6 16
France Belgium 12 16
I made a script which concates the to
and from
countries and then sorts the characters alphabetically to check whether, for example, UK - USA and USA-UK are the same and then aggregates the values.
This sorting part of my code looks like the following:
#concatenate to and from country name
country_from = data.frame(lapply(data_lines$Country_from, as.character), stringsAsFactors=FALSE)
country_to = data.frame(lapply(data_lines$Country_to, as.character), stringsAsFactors=FALSE)
concat_names = as.matrix(paste(country_from, country_to, " "))
#order characters alphabetically
strSort <- function(x)
sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")
sorted = strSort(concat_names)
This approach works in this specific case, but it could theoretically be the case that two different countries have the same alphabetically sorted characters.
If there is a Country_from
-Country_to
combination without the same reverse, then it should save the values as they are given (so do nothing).
Does anyone have an idea how to do this without using the alphabetically sorted characters?
答案 0 :(得分:1)
使用dplyr
的一种方法是通过对rowwise
和Country_from
进行排序和粘贴来创建Country_to
分组变量,然后按该分组使用sum
。
library(dplyr)
df %>%
rowwise() %>%
mutate(country = paste(sort(c(Country_from, Country_to)), collapse = "-")) %>%
ungroup() %>%
group_by(country) %>%
summarise_at(vars(Count:Value), funs(sum))
# country Count Value
# <chr> <int> <int>
#1 Belgium-France 12 16
#2 UK-USA 6 16
在这里,不是对字符进行排序,而是对单词进行排序。