Aggregate data in dataframe by first transforming values in column

时间:2018-12-03 13:06:20

标签: r dplyr

I have a data set with import and export numbers from countries which looks basically like this:

Country_from   Country_to    Count    Value
   UK             USA          5        10
   France         Belgium      4        7
   USA            UK           1        6
   Belgium        France       8        9

Now, I want to aggregate this data and to combine the import and export numbers by summation. So, I want my resulting dataframe to be:

 Country_from   Country_to    Count    Value
   UK             USA          6        16
   France         Belgium      12       16

I made a script which concates the to and from countries and then sorts the characters alphabetically to check whether, for example, UK - USA and USA-UK are the same and then aggregates the values.

This sorting part of my code looks like the following:

#concatenate to and from country name
country_from = data.frame(lapply(data_lines$Country_from, as.character), stringsAsFactors=FALSE)
country_to = data.frame(lapply(data_lines$Country_to, as.character), stringsAsFactors=FALSE)
concat_names = as.matrix(paste(country_from, country_to, " "))

#order characters alphabetically
strSort <- function(x)
 sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")

sorted = strSort(concat_names)

This approach works in this specific case, but it could theoretically be the case that two different countries have the same alphabetically sorted characters.

If there is a Country_from-Country_to combination without the same reverse, then it should save the values as they are given (so do nothing).

Does anyone have an idea how to do this without using the alphabetically sorted characters?

1 个答案:

答案 0 :(得分:1)

使用dplyr的一种方法是通过对rowwiseCountry_from进行排序和粘贴来创建Country_to分组变量,然后按该分组使用sum

library(dplyr)

df %>%
  rowwise() %>%
  mutate(country = paste(sort(c(Country_from, Country_to)), collapse = "-")) %>%
  ungroup() %>%
  group_by(country) %>%
  summarise_at(vars(Count:Value), funs(sum))

#   country        Count Value
#  <chr>          <int> <int>
#1 Belgium-France    12    16
#2 UK-USA             6    16

在这里,不是对字符进行排序,而是对单词进行排序。