Hierarchical data aggregation and manipulation

时间:2019-01-07 13:01:46

标签: r aggregate aggregate-functions hierarchical-data data-manipulation

I am a novice to r and I am trying to do deal with some inconsistencies in my data. My problem is twofold, the first part could be of general interest and it is about how to aggregate data that is classified in a vector with multiple levels of aggregation. The second problem is more closely related to my coding issues and it is about performing some specific operations for my data.

I am looking at exports data of hundreds of countries over a period of two decades. The issue is that my data on exports are classified by product and sub-product categories (hundreds), in an inconsistent manner and I am trying to deal with these discrepancies.

The data looks roughly like this:

df <- data.frame(
"Reporter" = c("USA", "USA", "USA", "USA", "USA", "USA","USA","EU", "EU","EU", "EU", "EU", "EU", "EU", "EU"),
"Partner" = c( "EU", "EU","EU","EU", "EU","EU","EU","USA", "USA", "USA","USA","USA", "USA","USA", "USA"), 
"Product cat." = c("1", "1.1", "1.2","2", "2.1", "2.2","3","1", "1.1","2", "2.1", "2.2","3","3.1", "3.2"), 
"Year" = c(1970, 1970, 1970, 1970, 1970, 1970,1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970), 
"Val" = c(200, 170, 30, 100, 50, 40,  220, 230, 180, 80, 50, 20, 170, 40, 130), 
stringsAsFactors = FALSE)

Product category 1.1 (eg. apple) and 1.2 (e.g. bananas) are sub-product categories of product category 1 (e.g. fruit). Product category 2.1 and 2.2 are subcategories of product cat. 2 and so on.

My ultimate goals are the following: First, the "value" of sub-product categories should always equate to the higher product category value. It is the case of USA export to the EU, product cat 1.1 (val=170) and 1.2 (val=30) aggregate to the level of product cat 1 (val=200). However, this is often not the case. For instance, in the case of USA export to the EU, the value of product cat 2.1 (val=50) and 2.2 (val=40) is smaller than product cat 2 (val=100). To deal with this issue I need to create a new sub-product cat. Ideally, this would (automatically) combine the beginning of the name of the product cat with a K (hence 2.K). This should be given a value of the difference between product cat 2 and its sub-product cat 2.1 and 2.2 (2.K= 100-(50+40) = 10). Also, I would like to apply the same approach to cases where I lack data on one of the subproduct categories. An example is in the exports from EU to USA where there are only values for product cat 1 and sub-product cat 1.1 and no information on cat 1.2. Ideally, I would create a new product cat (1.K) with the value of the difference between product cat 1 (val=230) and its sub-product cat 1.1(val = 180). Hence, the value of 1.k would be 230-180 = 50.

The second problem is that in some cases I do not have data on the sub-product categories, but I only have data at the aggregate level. As in the case of USA export to EU product cat 3 (that has no sub-categories). I would like to create a new sub-product cat a new that combines the beginning of the product cat with an M (hence 2.M) and incorporates the value at the product category level that is not reported in the subcategory level. Hence, for instance in the case of USA export to EU product cat 3 (220), 3.M = 220.

As mentioned, I think that there are two steps to deal with my coding issues. The first is on how to aggregate data that is hierarchical (to note that in my actual data I have three, not two, sub-product level (e.g. 1 food, 1.1fruit, 1.1.1 apples). Ideally, I would prefer avoiding creating new columns as my dataset involves hundreds of product categories. The second part is about performing the specific operations described above: 1) creating a new category with the difference between the father and child nodes, 2) creating fictitious child nodes. I would be really thankful to anyone that could help me with this as is key for the development of my paper.

I do realize it is a complex question, but also partial answers are very welcomed.

I thank you all in advance for your help

==============

Thank you a LAP lot for your help, Here is the problem I face with the real data after applying the function

split2 <- lapply(split1, function(x){
y <- rbind.data.frame(x, x[1,])
y[nrow(y), "Product.cat."] <- paste0(y[nrow(y), "Prodcat2"], "k")
y[nrow(y), "Val"] <- x[1, "Val"] - sum(x[2:nrow(x), "Val"])
return(y)
})

and the funtion split3 <- do.call(rbind, split2)

and here are the dput of the head of the two splits

>dput(Headsplit2)
list(`Algeria.United Arab Emirates.05` = structure(list(Reporter = 
c("Algeria", 
"Algeria", "Algeria", "Algeria"), Partner = c("United Arab Emirates", 
"United Arab Emirates", "United Arab Emirates", "United Arab 
Emirates"
), Year = c(2001L, 2001L, 2001L, 2001L), Product.cat. = c("05", 
"052", "054", "05k"), `Commodity Description` = c("Fruit and 
vegetables", 
"Dried fruit including artificially dehydrated", "Vegetables, roots & 
tubers, fresh or dried", 
"Fruit and vegetables"), `Trade Value` = 
structure(c(7.61814641291993e-319, 
7.4539189922423e-319, 1.64178014113046e-320, 7.61814641291993e-319
), class = "integer64"), Prodcat1 = c("0", "0", "0", "0"), Prodcat2 = 
c("05", 
"05", "05", "05")), row.names = c(NA, -4L), vars = c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE, indices = list(0:2), 
group_sizes = 3L, biggest_group_size = 3L, labels = structure(list(
Reporter = "Algeria", Partner = "United Arab Emirates", Prodcat2 = 
"05", 
Year = 2001L), row.names = c(NA, -1L), class = "data.frame", vars = 
c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame")), `Algeria.United Kingdom.05` = 
structure(list(
Reporter = c("Algeria", "Algeria", "Algeria", "Algeria"), 
Partner = c("United Kingdom", "United Kingdom", "United Kingdom", 
"United Kingdom"), Year = c(2001L, 2001L, 2001L, 2001L), 
Product.cat. = c("05", "053", "054", "05k"), `Commodity Description` 
= c("Fruit and vegetables", 
"Fruit,preserved and fruit preparations", "Vegetables, roots & 
tubers, fresh or dried", 
"Fruit and vegetables"), `Trade Value` = 
structure(c(6.99399328252869e-320, 
3.16547859290487e-320, 3.82802062397798e-320, 6.99399328252869e-320
), class = "integer64"), Prodcat1 = c("0", "0", "0", "0"), 
Prodcat2 = c("05", "05", "05", "05")), row.names = c(NA, 
-4L), vars = c("Reporter", "Partner", "Prodcat2", "Year"), drop = 
TRUE, indices = list(
0:2), group_sizes = 3L, biggest_group_size = 3L, labels = 
structure(list(
Reporter = "Algeria", Partner = "United Kingdom", Prodcat2 = "05", 
Year = 2001L), row.names = c(NA, -1L), class = "data.frame", vars = 
c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame")), Hungary.Austria.26 = structure(list(
Reporter = c("Hungary", "Hungary", "Hungary", "Hungary", 
"Hungary", "Hungary", "Hungary", "Hungary", "Hungary"), Partner = 
c("Austria", 
"Austria", "Austria", "Austria", "Austria", "Austria", "Austria", 
"Austria", "Austria"), Year = c(2000L, 2001L, 2000L, 2000L, 
2001L, 2000L, 2000L, 2001L, 2000L), Product.cat. = c("26", 
"26", "263", "265", "265", "266", "267", "267", "26k"), `Commodity 
Description` = c("Textile fibres, not manufactured, and waste", 
"Textile fibres, not manufactured, and waste", "Cotton", 
"Vegetable fibres,except cotton and jute", "Vegetable fibres,except 
cotton and jute", 
"Synthetic and regenerated artificial fibres", "Waste materials from 
textile fabrics, incl.rags", 
"Waste materials from textile fabrics, incl.rags", "Textile fibres, 
not manufactured, and waste"
), `Trade Value` = structure(c(7.3714594359514e-318, 
9.95542276370112e-318, 
4.94065645841247e-320, 2.96439387504748e-320, 6.91691904177745e-320, 
2.32210853545386e-319, 6.33886223614319e-318, 9.60957681161225e-318, 
7.3714594359514e-318), class = "integer64"), Prodcat1 = c("2", 
"2", "2", "2", "2", "2", "2", "2", "2"), Prodcat2 = c("26", 
"26", "26", "26", "26", "26", "26", "26", "26")), row.names = c(NA, 
-9L), vars = c("Reporter", "Partner", "Prodcat2", "Year"), drop = 
TRUE, indices = list(
c(0L, 2L, 3L, 5L, 6L), c(1L, 4L, 7L)), group_sizes = c(5L, 
3L), biggest_group_size = 5L, labels = structure(list(Reporter = 
c("Hungary", 
"Hungary"), Partner = c("Austria", "Austria"), Prodcat2 = c("26", 
"26"), Year = 2000:2001), row.names = c(NA, -2L), class = 
"data.frame", vars = c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame")), Hungary.Belgium.26 = structure(list(
Reporter = c("Hungary", "Hungary", "Hungary", "Hungary", 
"Hungary", "Hungary", "Hungary", "Hungary", "Hungary"), Partner = 
c("Belgium", 
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", 
"Belgium", "Belgium"), Year = c(2000L, 2001L, 2000L, 2001L, 
2000L, 2001L, 2000L, 2001L, 2000L), Product.cat. = c("26", 
"26", "265", "265", "266", "266", "267", "267", "26k"), `Commodity 
Description` = c("Textile fibres, not manufactured, and waste", 
"Textile fibres, not manufactured, and waste", "Vegetable 
fibres,except cotton and jute", 
"Vegetable fibres,except cotton and jute", "Synthetic and regenerated 
artificial fibres", 
"Synthetic and regenerated artificial fibres", "Waste materials from 
textile fabrics, incl.rags", 
"Waste materials from textile fabrics, incl.rags", "Textile fibres, 
 not manufactured, and waste"
 ), `Trade Value` = structure(c(3.41893426922143e-318, 
7.98410083679454e-318, 
3.95252516672997e-320, 9.73309322307256e-319, 1.67488253940183e-318, 
1.665001226485e-318, 8.49792910846944e-319, 7.70742407512345e-319, 
3.41893426922143e-318), class = "integer64"), Prodcat1 = c("2", 
"2", "2", "2", "2", "2", "2", "2", "2"), Prodcat2 = c("26", 
"26", "26", "26", "26", "26", "26", "26", "26")), row.names = c(NA, 
-9L), vars = c("Reporter", "Partner", "Prodcat2", "Year"), drop = 
TRUE, indices = list(
c(0L, 2L, 4L, 6L), c(1L, 3L, 5L, 7L)), group_sizes = c(4L, 
4L), biggest_group_size = 4L, labels = structure(list(Reporter = 
c("Hungary", 
"Hungary"), Partner = c("Belgium", "Belgium"), Prodcat2 = c("26", 
"26"), Year = 2000:2001), row.names = c(NA, -2L), class = 
 "data.frame", vars = c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame")), Hungary.Bulgaria.26 = 
structure(list(
Reporter = c("Hungary", "Hungary", "Hungary", "Hungary", 
"Hungary", "Hungary"), Partner = c("Bulgaria", "Bulgaria", 
"Bulgaria", "Bulgaria", "Bulgaria", "Bulgaria"), Year = c(2000L, 
2001L, 2000L, 2001L, 2000L, 2000L), Product.cat. = c("26", 
"26", "266", "266", "267", "26k"), `Commodity Description` = 
c("Textile fibres, not manufactured, and waste", 
"Textile fibres, not manufactured, and waste", "Synthetic and 
regenerated artificial fibres", 
"Synthetic and regenerated artificial fibres", "Waste materials from 
textile fabrics, incl.rags", 
"Textile fibres, not manufactured, and waste"), `Trade Value` = 
structure(c(1.55136612794151e-318, 
1.53160350210786e-319, 4.94065645841247e-321, 4.94065645841247e-321, 
2.96439387504748e-320, 1.55136612794151e-318), class = "integer64"), 
Prodcat1 = c("2", "2", "2", "2", "2", "2"), Prodcat2 = c("26", 
"26", "26", "26", "26", "26")), row.names = c(NA, -6L), vars = 
c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE, indices = list(c(0L, 
 2L, 4L), c(1L, 3L)), group_sizes = 3:2, biggest_group_size = 3L, 
labels = structure(list(
Reporter = c("Hungary", "Hungary"), Partner = c("Bulgaria", 
"Bulgaria"), Prodcat2 = c("26", "26"), Year = 2000:2001), row.names = 
c(NA, 
-2L), class = "data.frame", vars = c("Reporter", "Partner", 
"Prodcat2", 
"Year"), drop = TRUE), class = c("grouped_df", "tbl_df", "tbl", 
"data.frame")), Hungary.Canada.26 = structure(list(Reporter = 
 c("Hungary", 
 "Hungary", "Hungary"), Partner = c("Canada", "Canada", "Canada"
 ), Year = c(2001L, 2001L, 2001L), Product.cat. = c("26", "265", 
 "26k"), `Commodity Description` = c("Textile fibres, not 
 manufactured, and waste", 
 "Vegetable fibres,except cotton and jute", "Textile fibres, not 
 manufactured, and waste"
 ), `Trade Value` = structure(c(8.89318162514244e-320, 
 6.4228533959362e-320, 
 8.89318162514244e-320), class = "integer64"), Prodcat1 = c("2", 
 "2", "2"), Prodcat2 = c("26", "26", "26")), row.names = c(NA, 
 -3L), vars = c("Reporter", "Partner", "Prodcat2", "Year"), drop = 
 TRUE, indices = list(
 0:1), group_sizes = 2L, biggest_group_size = 2L, labels = 
structure(list(
Reporter = "Hungary", Partner = "Canada", Prodcat2 = "26", 
Year = 2001L), row.names = c(NA, -1L), class = "data.frame", vars = 
c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame")))

And of split 3

dput(Headsplit3)

structure(list(Reporter = c("Algeria", "Algeria", "Algeria", 
"Algeria", "Algeria", "Algeria"), Partner = c("United Arab Emirates", 
"United Arab Emirates", "United Arab Emirates", "United Arab 
Emirates", 
"United Kingdom", "United Kingdom"), Year = c(2001L, 2001L, 2001L, 
2001L, 2001L, 2001L), Product.cat. = c("05", "052", "054", "05k", 
"05", "053"), `Commodity Description` = c("Fruit and vegetables", 
"Dried fruit including artificially dehydrated", "Vegetables, roots & 
tubers, fresh or dried", 
"Fruit and vegetables", "Fruit and vegetables", "Fruit,preserved and 
fruit preparations"
), `Trade Value` = structure(c(7.61814641291993e-319, 
7.4539189922423e-319, 
1.64178014113046e-320, 7.61814641291993e-319, 6.99399328252869e-320, 
3.16547859290487e-320), class = "integer64"), Prodcat1 = c("0", 
"0", "0", "0", "0", "0"), Prodcat2 = c("05", "05", "05", "05", 
"05", "05")), row.names = c(NA, -6L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = c("Reporter", "Partner", 
"Prodcat2", "Year"), drop = TRUE, indices = list(0:3, 4:5), 
group_sizes = c(4L, 
2L), biggest_group_size = 4L, labels = structure(list(Reporter = 
c("Algeria", 
"Algeria"), Partner = c("United Arab Emirates", "United Kingdom"
), Prodcat2 = c("05", "05"), Year = c(2001L, 2001L)), row.names = 
c(NA, 
-2L), class = "data.frame", vars = c("Reporter", "Partner", 
"Prodcat2", 
"Year"), drop = TRUE))

As you can see the code is able to identify that Algeria exports of 052 and 054 to the United Arab Emirates do not add up the exports of 05 - (the difference is only 1) and it does correctly creates a variable of 05k, yet the traded value of 05k is 154193 (= to the traded value of the whole 05) rather than being 1. Do you know why this could be the case?

1 个答案:

答案 0 :(得分:1)

编辑:好的,我想我知道了!


数据:

df <- data.frame( "Reporter" = c("USA", "USA", "USA", "USA", "USA", "USA","USA", "USA", "USA","USA"), 
                  "Partner" = c( "EU", "EU","EU","EU", "EU","EU","EU", "EU","EU","EU"), 
                  "Product cat." = c("1", "11","111", "12","2", "21", "211", "212", "22", "3"), 
                  "Val" = c(200, 170, 170, 30, 100, 50, 25, 5, 40, 220), stringsAsFactors = FALSE)

我们首先创建两个辅助变量Prodcat1Prodcat2

# create new variable Prodcat1 
df1 <- df %>% group_by(Reporter, Partner) %>% mutate(Prodcat1 = str_extract(Product.cat., "^.{1}")) 

# create new variable Prodcat2 for my 2nd level product category 
df1 <- df1 %>% group_by(Reporter, Partner) %>% mutate(Prodcat2 = str_extract(Product.cat., "^.{2}"))

现在,我们将数据分为两部分,一个要完成,另一个不需要在第三层进行任何操作:

# to be completed
df2 <- df1 %>%
  group_by(Reporter, Partner, Prodcat2) %>%
  filter(sum(Val[2:n()]) < Val[1])

# no operation on third level
df3 <- df1 %>%
  group_by(Reporter, Partner, Prodcat2) %>%
  filter(!sum(Val[2:n()]) < Val[1] | n() == 1)

我们将df2除以Prodcat2,控制ReporterPartner

split1 <- split(df2, interaction(df2$Reporter, df2$Partner, df2$Prodcat2))
split1 <- split1[sapply(split1, nrow) != 0]

并在必要时添加新行:

split2 <- lapply(split1, function(x){
  y <- rbind.data.frame(x, x[1,])
  y[nrow(y), "Product.cat."] <- paste0(y[nrow(y), "Prodcat2"], "k")
  y[nrow(y), "Val"] <- x[1, "Val"] - sum(x[2:nrow(x), "Val"])
  return(y)
})

然后,我们第一次将数据重新整理在一起,并按原始的Product.cat.对其进行排序

split3 <- do.call(rbind, split2)
newdf <- do.call(rbind, list(split3, df3))

newdf <- newdf %>%
  arrange(Product.cat.)

到目前为止的数据:

# A tibble: 11 x 6
# Groups:   Reporter, Partner, Prodcat2 [5]
   Reporter Partner Product.cat.   Val Prodcat1 Prodcat2
   <chr>    <chr>   <chr>        <dbl> <chr>    <chr>   
 1 USA      EU      1              200 1        NA      
 2 USA      EU      11             170 1        11      
 3 USA      EU      111            170 1        11      
 4 USA      EU      12              30 1        12      
 5 USA      EU      2              100 2        NA      
 6 USA      EU      21              50 2        21      
 7 USA      EU      211             25 2        21      
 8 USA      EU      212              5 2        21      
 9 USA      EU      21k             20 2        21      
10 USA      EU      22              40 2        22      
11 USA      EU      3              220 3        NA  

现在,我们进入第二级。首先,我们创建三个部分:

# part to complete
df4 <- newdf %>%
  group_by(Reporter, Partner, Prodcat1) %>%
  filter(nchar(Product.cat.) < 3) %>%
  filter(n() == 1 | sum(Val[2:n()]) < Val[1])

# third level rows, which are not necessary here
df5 <- newdf %>%
  group_by(Reporter, Partner, Prodcat1) %>%
  filter(nchar(Product.cat.) == 3)

# second level part already complete
df6 <- newdf %>%
  group_by(Reporter, Partner, Prodcat1) %>%
  filter(nchar(Product.cat.) < 3) %>%
  filter(sum(Val[2:n()]) == Val[1])

我们再次通过Prodcat1拆分数据,控制ReporterPartner

split3 <- split(df4, interaction(df4$Reporter, df4$Partner, df4$Prodcat1))
split3 <- split3[sapply(split3, nrow) != 0]

我们创建新行:

split4 <- lapply(split3, function(x){
  if(nrow(x) == 1){
    y <- rbind.data.frame(x, x)
    y[2, "Product.cat."] <- paste0(y[2, "Prodcat1"], "m")
  }else{
    y <- rbind.data.frame(x, x[1,])
    y[nrow(y), "Product.cat."] <- paste0(y[nrow(y), "Prodcat1"], "k")
    y[nrow(y), "Val"] <- x[1, "Val"] - sum(x[2:nrow(x), "Val"])
  }
  return(y)
})

然后将它们重新粘在一起,再次排序并除去辅助变量。

split5 <- do.call(rbind, split4)
finaldf <- do.call(rbind, list(split5, df5, df6))

finaldf <- finaldf %>%
  ungroup() %>%
  arrange(Product.cat.) %>%
  select(-c("Prodcat1", "Prodcat2"))

最终数据:

# A tibble: 13 x 4
   Reporter Partner Product.cat.   Val
   <chr>    <chr>   <chr>        <dbl>
 1 USA      EU      1              200
 2 USA      EU      11             170
 3 USA      EU      111            170
 4 USA      EU      12              30
 5 USA      EU      2              100
 6 USA      EU      21              50
 7 USA      EU      211             25
 8 USA      EU      212              5
 9 USA      EU      21k             20
10 USA      EU      22              40
11 USA      EU      2k              10
12 USA      EU      3              220
13 USA      EU      3m             220

最后,我们清除了所有需要的临时对象的环境

rm(df1, df2, df3, df4, df5, df6, newdf, split1, split2, split3, split4, split5)

剩下原始数据集df和最终的完整数据集finaldata:)