I am a novice to r and I am trying to do deal with some inconsistencies in my data. My problem is twofold, the first part could be of general interest and it is about how to aggregate data that is classified in a vector with multiple levels of aggregation. The second problem is more closely related to my coding issues and it is about performing some specific operations for my data.
I am looking at exports data of hundreds of countries over a period of two decades. The issue is that my data on exports are classified by product and sub-product categories (hundreds), in an inconsistent manner and I am trying to deal with these discrepancies.
The data looks roughly like this:
df <- data.frame(
"Reporter" = c("USA", "USA", "USA", "USA", "USA", "USA","USA","EU", "EU","EU", "EU", "EU", "EU", "EU", "EU"),
"Partner" = c( "EU", "EU","EU","EU", "EU","EU","EU","USA", "USA", "USA","USA","USA", "USA","USA", "USA"),
"Product cat." = c("1", "1.1", "1.2","2", "2.1", "2.2","3","1", "1.1","2", "2.1", "2.2","3","3.1", "3.2"),
"Year" = c(1970, 1970, 1970, 1970, 1970, 1970,1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970),
"Val" = c(200, 170, 30, 100, 50, 40, 220, 230, 180, 80, 50, 20, 170, 40, 130),
stringsAsFactors = FALSE)
Product category 1.1 (eg. apple) and 1.2 (e.g. bananas) are sub-product categories of product category 1 (e.g. fruit). Product category 2.1 and 2.2 are subcategories of product cat. 2 and so on.
My ultimate goals are the following: First, the "value" of sub-product categories should always equate to the higher product category value. It is the case of USA export to the EU, product cat 1.1 (val=170) and 1.2 (val=30) aggregate to the level of product cat 1 (val=200). However, this is often not the case. For instance, in the case of USA export to the EU, the value of product cat 2.1 (val=50) and 2.2 (val=40) is smaller than product cat 2 (val=100). To deal with this issue I need to create a new sub-product cat. Ideally, this would (automatically) combine the beginning of the name of the product cat with a K (hence 2.K). This should be given a value of the difference between product cat 2 and its sub-product cat 2.1 and 2.2 (2.K= 100-(50+40) = 10). Also, I would like to apply the same approach to cases where I lack data on one of the subproduct categories. An example is in the exports from EU to USA where there are only values for product cat 1 and sub-product cat 1.1 and no information on cat 1.2. Ideally, I would create a new product cat (1.K) with the value of the difference between product cat 1 (val=230) and its sub-product cat 1.1(val = 180). Hence, the value of 1.k would be 230-180 = 50.
The second problem is that in some cases I do not have data on the sub-product categories, but I only have data at the aggregate level. As in the case of USA export to EU product cat 3 (that has no sub-categories). I would like to create a new sub-product cat a new that combines the beginning of the product cat with an M (hence 2.M) and incorporates the value at the product category level that is not reported in the subcategory level. Hence, for instance in the case of USA export to EU product cat 3 (220), 3.M = 220.
As mentioned, I think that there are two steps to deal with my coding issues. The first is on how to aggregate data that is hierarchical (to note that in my actual data I have three, not two, sub-product level (e.g. 1 food, 1.1fruit, 1.1.1 apples). Ideally, I would prefer avoiding creating new columns as my dataset involves hundreds of product categories. The second part is about performing the specific operations described above: 1) creating a new category with the difference between the father and child nodes, 2) creating fictitious child nodes. I would be really thankful to anyone that could help me with this as is key for the development of my paper.
I do realize it is a complex question, but also partial answers are very welcomed.
I thank you all in advance for your help
==============
Thank you a LAP lot for your help, Here is the problem I face with the real data after applying the function
split2 <- lapply(split1, function(x){
y <- rbind.data.frame(x, x[1,])
y[nrow(y), "Product.cat."] <- paste0(y[nrow(y), "Prodcat2"], "k")
y[nrow(y), "Val"] <- x[1, "Val"] - sum(x[2:nrow(x), "Val"])
return(y)
})
and the funtion split3 <- do.call(rbind, split2)
and here are the dput of the head of the two splits
>dput(Headsplit2)
list(`Algeria.United Arab Emirates.05` = structure(list(Reporter =
c("Algeria",
"Algeria", "Algeria", "Algeria"), Partner = c("United Arab Emirates",
"United Arab Emirates", "United Arab Emirates", "United Arab
Emirates"
), Year = c(2001L, 2001L, 2001L, 2001L), Product.cat. = c("05",
"052", "054", "05k"), `Commodity Description` = c("Fruit and
vegetables",
"Dried fruit including artificially dehydrated", "Vegetables, roots &
tubers, fresh or dried",
"Fruit and vegetables"), `Trade Value` =
structure(c(7.61814641291993e-319,
7.4539189922423e-319, 1.64178014113046e-320, 7.61814641291993e-319
), class = "integer64"), Prodcat1 = c("0", "0", "0", "0"), Prodcat2 =
c("05",
"05", "05", "05")), row.names = c(NA, -4L), vars = c("Reporter",
"Partner", "Prodcat2", "Year"), drop = TRUE, indices = list(0:2),
group_sizes = 3L, biggest_group_size = 3L, labels = structure(list(
Reporter = "Algeria", Partner = "United Arab Emirates", Prodcat2 =
"05",
Year = 2001L), row.names = c(NA, -1L), class = "data.frame", vars =
c("Reporter",
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame")), `Algeria.United Kingdom.05` =
structure(list(
Reporter = c("Algeria", "Algeria", "Algeria", "Algeria"),
Partner = c("United Kingdom", "United Kingdom", "United Kingdom",
"United Kingdom"), Year = c(2001L, 2001L, 2001L, 2001L),
Product.cat. = c("05", "053", "054", "05k"), `Commodity Description`
= c("Fruit and vegetables",
"Fruit,preserved and fruit preparations", "Vegetables, roots &
tubers, fresh or dried",
"Fruit and vegetables"), `Trade Value` =
structure(c(6.99399328252869e-320,
3.16547859290487e-320, 3.82802062397798e-320, 6.99399328252869e-320
), class = "integer64"), Prodcat1 = c("0", "0", "0", "0"),
Prodcat2 = c("05", "05", "05", "05")), row.names = c(NA,
-4L), vars = c("Reporter", "Partner", "Prodcat2", "Year"), drop =
TRUE, indices = list(
0:2), group_sizes = 3L, biggest_group_size = 3L, labels =
structure(list(
Reporter = "Algeria", Partner = "United Kingdom", Prodcat2 = "05",
Year = 2001L), row.names = c(NA, -1L), class = "data.frame", vars =
c("Reporter",
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame")), Hungary.Austria.26 = structure(list(
Reporter = c("Hungary", "Hungary", "Hungary", "Hungary",
"Hungary", "Hungary", "Hungary", "Hungary", "Hungary"), Partner =
c("Austria",
"Austria", "Austria", "Austria", "Austria", "Austria", "Austria",
"Austria", "Austria"), Year = c(2000L, 2001L, 2000L, 2000L,
2001L, 2000L, 2000L, 2001L, 2000L), Product.cat. = c("26",
"26", "263", "265", "265", "266", "267", "267", "26k"), `Commodity
Description` = c("Textile fibres, not manufactured, and waste",
"Textile fibres, not manufactured, and waste", "Cotton",
"Vegetable fibres,except cotton and jute", "Vegetable fibres,except
cotton and jute",
"Synthetic and regenerated artificial fibres", "Waste materials from
textile fabrics, incl.rags",
"Waste materials from textile fabrics, incl.rags", "Textile fibres,
not manufactured, and waste"
), `Trade Value` = structure(c(7.3714594359514e-318,
9.95542276370112e-318,
4.94065645841247e-320, 2.96439387504748e-320, 6.91691904177745e-320,
2.32210853545386e-319, 6.33886223614319e-318, 9.60957681161225e-318,
7.3714594359514e-318), class = "integer64"), Prodcat1 = c("2",
"2", "2", "2", "2", "2", "2", "2", "2"), Prodcat2 = c("26",
"26", "26", "26", "26", "26", "26", "26", "26")), row.names = c(NA,
-9L), vars = c("Reporter", "Partner", "Prodcat2", "Year"), drop =
TRUE, indices = list(
c(0L, 2L, 3L, 5L, 6L), c(1L, 4L, 7L)), group_sizes = c(5L,
3L), biggest_group_size = 5L, labels = structure(list(Reporter =
c("Hungary",
"Hungary"), Partner = c("Austria", "Austria"), Prodcat2 = c("26",
"26"), Year = 2000:2001), row.names = c(NA, -2L), class =
"data.frame", vars = c("Reporter",
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame")), Hungary.Belgium.26 = structure(list(
Reporter = c("Hungary", "Hungary", "Hungary", "Hungary",
"Hungary", "Hungary", "Hungary", "Hungary", "Hungary"), Partner =
c("Belgium",
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium",
"Belgium", "Belgium"), Year = c(2000L, 2001L, 2000L, 2001L,
2000L, 2001L, 2000L, 2001L, 2000L), Product.cat. = c("26",
"26", "265", "265", "266", "266", "267", "267", "26k"), `Commodity
Description` = c("Textile fibres, not manufactured, and waste",
"Textile fibres, not manufactured, and waste", "Vegetable
fibres,except cotton and jute",
"Vegetable fibres,except cotton and jute", "Synthetic and regenerated
artificial fibres",
"Synthetic and regenerated artificial fibres", "Waste materials from
textile fabrics, incl.rags",
"Waste materials from textile fabrics, incl.rags", "Textile fibres,
not manufactured, and waste"
), `Trade Value` = structure(c(3.41893426922143e-318,
7.98410083679454e-318,
3.95252516672997e-320, 9.73309322307256e-319, 1.67488253940183e-318,
1.665001226485e-318, 8.49792910846944e-319, 7.70742407512345e-319,
3.41893426922143e-318), class = "integer64"), Prodcat1 = c("2",
"2", "2", "2", "2", "2", "2", "2", "2"), Prodcat2 = c("26",
"26", "26", "26", "26", "26", "26", "26", "26")), row.names = c(NA,
-9L), vars = c("Reporter", "Partner", "Prodcat2", "Year"), drop =
TRUE, indices = list(
c(0L, 2L, 4L, 6L), c(1L, 3L, 5L, 7L)), group_sizes = c(4L,
4L), biggest_group_size = 4L, labels = structure(list(Reporter =
c("Hungary",
"Hungary"), Partner = c("Belgium", "Belgium"), Prodcat2 = c("26",
"26"), Year = 2000:2001), row.names = c(NA, -2L), class =
"data.frame", vars = c("Reporter",
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame")), Hungary.Bulgaria.26 =
structure(list(
Reporter = c("Hungary", "Hungary", "Hungary", "Hungary",
"Hungary", "Hungary"), Partner = c("Bulgaria", "Bulgaria",
"Bulgaria", "Bulgaria", "Bulgaria", "Bulgaria"), Year = c(2000L,
2001L, 2000L, 2001L, 2000L, 2000L), Product.cat. = c("26",
"26", "266", "266", "267", "26k"), `Commodity Description` =
c("Textile fibres, not manufactured, and waste",
"Textile fibres, not manufactured, and waste", "Synthetic and
regenerated artificial fibres",
"Synthetic and regenerated artificial fibres", "Waste materials from
textile fabrics, incl.rags",
"Textile fibres, not manufactured, and waste"), `Trade Value` =
structure(c(1.55136612794151e-318,
1.53160350210786e-319, 4.94065645841247e-321, 4.94065645841247e-321,
2.96439387504748e-320, 1.55136612794151e-318), class = "integer64"),
Prodcat1 = c("2", "2", "2", "2", "2", "2"), Prodcat2 = c("26",
"26", "26", "26", "26", "26")), row.names = c(NA, -6L), vars =
c("Reporter",
"Partner", "Prodcat2", "Year"), drop = TRUE, indices = list(c(0L,
2L, 4L), c(1L, 3L)), group_sizes = 3:2, biggest_group_size = 3L,
labels = structure(list(
Reporter = c("Hungary", "Hungary"), Partner = c("Bulgaria",
"Bulgaria"), Prodcat2 = c("26", "26"), Year = 2000:2001), row.names =
c(NA,
-2L), class = "data.frame", vars = c("Reporter", "Partner",
"Prodcat2",
"Year"), drop = TRUE), class = c("grouped_df", "tbl_df", "tbl",
"data.frame")), Hungary.Canada.26 = structure(list(Reporter =
c("Hungary",
"Hungary", "Hungary"), Partner = c("Canada", "Canada", "Canada"
), Year = c(2001L, 2001L, 2001L), Product.cat. = c("26", "265",
"26k"), `Commodity Description` = c("Textile fibres, not
manufactured, and waste",
"Vegetable fibres,except cotton and jute", "Textile fibres, not
manufactured, and waste"
), `Trade Value` = structure(c(8.89318162514244e-320,
6.4228533959362e-320,
8.89318162514244e-320), class = "integer64"), Prodcat1 = c("2",
"2", "2"), Prodcat2 = c("26", "26", "26")), row.names = c(NA,
-3L), vars = c("Reporter", "Partner", "Prodcat2", "Year"), drop =
TRUE, indices = list(
0:1), group_sizes = 2L, biggest_group_size = 2L, labels =
structure(list(
Reporter = "Hungary", Partner = "Canada", Prodcat2 = "26",
Year = 2001L), row.names = c(NA, -1L), class = "data.frame", vars =
c("Reporter",
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame")))
And of split 3
dput(Headsplit3)
structure(list(Reporter = c("Algeria", "Algeria", "Algeria",
"Algeria", "Algeria", "Algeria"), Partner = c("United Arab Emirates",
"United Arab Emirates", "United Arab Emirates", "United Arab
Emirates",
"United Kingdom", "United Kingdom"), Year = c(2001L, 2001L, 2001L,
2001L, 2001L, 2001L), Product.cat. = c("05", "052", "054", "05k",
"05", "053"), `Commodity Description` = c("Fruit and vegetables",
"Dried fruit including artificially dehydrated", "Vegetables, roots &
tubers, fresh or dried",
"Fruit and vegetables", "Fruit and vegetables", "Fruit,preserved and
fruit preparations"
), `Trade Value` = structure(c(7.61814641291993e-319,
7.4539189922423e-319,
1.64178014113046e-320, 7.61814641291993e-319, 6.99399328252869e-320,
3.16547859290487e-320), class = "integer64"), Prodcat1 = c("0",
"0", "0", "0", "0", "0"), Prodcat2 = c("05", "05", "05", "05",
"05", "05")), row.names = c(NA, -6L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = c("Reporter", "Partner",
"Prodcat2", "Year"), drop = TRUE, indices = list(0:3, 4:5),
group_sizes = c(4L,
2L), biggest_group_size = 4L, labels = structure(list(Reporter =
c("Algeria",
"Algeria"), Partner = c("United Arab Emirates", "United Kingdom"
), Prodcat2 = c("05", "05"), Year = c(2001L, 2001L)), row.names =
c(NA,
-2L), class = "data.frame", vars = c("Reporter", "Partner",
"Prodcat2",
"Year"), drop = TRUE))
As you can see the code is able to identify that Algeria exports of 052 and 054 to the United Arab Emirates do not add up the exports of 05 - (the difference is only 1) and it does correctly creates a variable of 05k, yet the traded value of 05k is 154193 (= to the traded value of the whole 05) rather than being 1. Do you know why this could be the case?
答案 0 :(得分:1)
编辑:好的,我想我知道了!
数据:
df <- data.frame( "Reporter" = c("USA", "USA", "USA", "USA", "USA", "USA","USA", "USA", "USA","USA"),
"Partner" = c( "EU", "EU","EU","EU", "EU","EU","EU", "EU","EU","EU"),
"Product cat." = c("1", "11","111", "12","2", "21", "211", "212", "22", "3"),
"Val" = c(200, 170, 170, 30, 100, 50, 25, 5, 40, 220), stringsAsFactors = FALSE)
我们首先创建两个辅助变量Prodcat1
和Prodcat2
:
# create new variable Prodcat1
df1 <- df %>% group_by(Reporter, Partner) %>% mutate(Prodcat1 = str_extract(Product.cat., "^.{1}"))
# create new variable Prodcat2 for my 2nd level product category
df1 <- df1 %>% group_by(Reporter, Partner) %>% mutate(Prodcat2 = str_extract(Product.cat., "^.{2}"))
现在,我们将数据分为两部分,一个要完成,另一个不需要在第三层进行任何操作:
# to be completed
df2 <- df1 %>%
group_by(Reporter, Partner, Prodcat2) %>%
filter(sum(Val[2:n()]) < Val[1])
# no operation on third level
df3 <- df1 %>%
group_by(Reporter, Partner, Prodcat2) %>%
filter(!sum(Val[2:n()]) < Val[1] | n() == 1)
我们将df2
除以Prodcat2
,控制Reporter
和Partner
split1 <- split(df2, interaction(df2$Reporter, df2$Partner, df2$Prodcat2))
split1 <- split1[sapply(split1, nrow) != 0]
并在必要时添加新行:
split2 <- lapply(split1, function(x){
y <- rbind.data.frame(x, x[1,])
y[nrow(y), "Product.cat."] <- paste0(y[nrow(y), "Prodcat2"], "k")
y[nrow(y), "Val"] <- x[1, "Val"] - sum(x[2:nrow(x), "Val"])
return(y)
})
然后,我们第一次将数据重新整理在一起,并按原始的Product.cat.
对其进行排序。
split3 <- do.call(rbind, split2)
newdf <- do.call(rbind, list(split3, df3))
newdf <- newdf %>%
arrange(Product.cat.)
到目前为止的数据:
# A tibble: 11 x 6
# Groups: Reporter, Partner, Prodcat2 [5]
Reporter Partner Product.cat. Val Prodcat1 Prodcat2
<chr> <chr> <chr> <dbl> <chr> <chr>
1 USA EU 1 200 1 NA
2 USA EU 11 170 1 11
3 USA EU 111 170 1 11
4 USA EU 12 30 1 12
5 USA EU 2 100 2 NA
6 USA EU 21 50 2 21
7 USA EU 211 25 2 21
8 USA EU 212 5 2 21
9 USA EU 21k 20 2 21
10 USA EU 22 40 2 22
11 USA EU 3 220 3 NA
现在,我们进入第二级。首先,我们创建三个部分:
# part to complete
df4 <- newdf %>%
group_by(Reporter, Partner, Prodcat1) %>%
filter(nchar(Product.cat.) < 3) %>%
filter(n() == 1 | sum(Val[2:n()]) < Val[1])
# third level rows, which are not necessary here
df5 <- newdf %>%
group_by(Reporter, Partner, Prodcat1) %>%
filter(nchar(Product.cat.) == 3)
# second level part already complete
df6 <- newdf %>%
group_by(Reporter, Partner, Prodcat1) %>%
filter(nchar(Product.cat.) < 3) %>%
filter(sum(Val[2:n()]) == Val[1])
我们再次通过Prodcat1
拆分数据,控制Reporter
和Partner
:
split3 <- split(df4, interaction(df4$Reporter, df4$Partner, df4$Prodcat1))
split3 <- split3[sapply(split3, nrow) != 0]
我们创建新行:
split4 <- lapply(split3, function(x){
if(nrow(x) == 1){
y <- rbind.data.frame(x, x)
y[2, "Product.cat."] <- paste0(y[2, "Prodcat1"], "m")
}else{
y <- rbind.data.frame(x, x[1,])
y[nrow(y), "Product.cat."] <- paste0(y[nrow(y), "Prodcat1"], "k")
y[nrow(y), "Val"] <- x[1, "Val"] - sum(x[2:nrow(x), "Val"])
}
return(y)
})
然后将它们重新粘在一起,再次排序并除去辅助变量。
split5 <- do.call(rbind, split4)
finaldf <- do.call(rbind, list(split5, df5, df6))
finaldf <- finaldf %>%
ungroup() %>%
arrange(Product.cat.) %>%
select(-c("Prodcat1", "Prodcat2"))
最终数据:
# A tibble: 13 x 4
Reporter Partner Product.cat. Val
<chr> <chr> <chr> <dbl>
1 USA EU 1 200
2 USA EU 11 170
3 USA EU 111 170
4 USA EU 12 30
5 USA EU 2 100
6 USA EU 21 50
7 USA EU 211 25
8 USA EU 212 5
9 USA EU 21k 20
10 USA EU 22 40
11 USA EU 2k 10
12 USA EU 3 220
13 USA EU 3m 220
最后,我们清除了所有需要的临时对象的环境
rm(df1, df2, df3, df4, df5, df6, newdf, split1, split2, split3, split4, split5)
剩下原始数据集df
和最终的完整数据集finaldata
:)