我有一个看起来像这样的数据集:
set.seed(2)
origin <- rep(c("DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR","DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR"), 2)
year <- rep(c(1998,1998,1998,1998,1998,1998,1998,1998,1998,1998,2000,2000,2000,2000,2000,2000,2000,2000,2000,2000), 2)
value <- sample(1:10000, size=length(origin), replace=TRUE)
test.df <- as.data.frame(cbind(origin, year, value))
rm(origin, year, value)
然后我有2个列表。
第一个是使用ISOcodes
库构建的按地区划分的国家/地区列表,如下所示:
library("ISOcodes")
list.continent <- list(asia = c("Central Asia", "Eastern Asia", "South-eastern Asia", "Southern Asia", "Western Asia"),
africa = c("Northern Africa", "Sub-Saharan Africa", "Eastern Africa", "Middle Africa", "Southern Africa", "Western Africa"),
europe = c("Eastern Europe", "Northern Europe", "Channel Islands", "Southern Europe", "Western Europe"),
oceania = c("Australia and New Zealand", "Melanesia", "Micronesia", "Polynesia"),
northamerica = c("Northern America"),
latinamerica = c("South America", "Central America", "Caribbean"))
country.list.continent <- sapply(list.continent, function(item) {
region <- subset(UN_M.49_Regions, Name %in% item)
sub <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", ")))
return(sub$ISO_Alpha_3)
}, simplify = FALSE)
rm(list.continent)
以及带有年份的其他列表:
year.list <- levels(as.factor(unique(test.df$year)))
我想用与特定年份的精确区域相对应的计算数字来填充矩阵。矩阵如下:
ncol <- length(year.list)
nrow <- length(country.list.continent)
matrix.extraction <- matrix(, nrow = nrow, ncol = ncol)
rownames(matrix.extraction) <- names(country.list.continent)
colnames(matrix.extraction) <- year.list
要进行计算,我有一个循环,可以对太大的数据集进行子集处理,否则...该循环基于年份(相当于colnames(matrix.extraction)
)。这个想法是要计算每年代表每个国家/地区价值的百分比(以百分比为单位)。计算部分非常简单并且运行良好。当我需要将值分配给每一行时,就会出现我的问题。
for(i in 1:length(colnames(matrix.extraction))){
### I subset and compute what I want
table.temp <- test.df %>%
subset(year == colnames(matrix.extraction)[i]) %>%
group_by(origin) %>%
summarise(value = sum(value, na.rm = TRUE))
table.temp$percent <- prop.table(table.temp$value)
### then I need to attribute the wanted values
matrix.extraction["ROWNAME",i] <- table.temp %>%
subset(origin %in% country.list.continent$"ROWNAME") %>%
summarise(. ,sum = sum(percent)))
}
我真的不知道该怎么做。
预期结果是一个矩阵,如:
1998 2000
asia here NA
africa NA NA
europe NA NA
oceania NA NA
northamerica NA NA
latinamerica NA NA
使用,而不是[1,1]中的“这里”,而是以行名表示的年份中行名所在区域的每个国家/地区的总和。
任何帮助将不胜感激。
答案 0 :(得分:1)
使用双sapply
,我们可以遍历year.list
和
country.list.continent
并为每种组合计算sum
中的value
。
sapply(year.list, function(x) sapply(names(country.list.continent), function(y) {
with(test.df, sum(value[origin %in% country.list.continent[[y]] & year == x]))
}))
# 1998 2000
#asia 21759 20059
#africa 0 0
#europe 39700 35981
#oceania 0 0
#northamerica 21347 17324
#latinamerica 10847 8672
如果我们对tidyverse
解决方案感兴趣
library(tidyverse)
crossing(x = year.list, y = names(country.list.continent)) %>%
mutate(sum = map2_dbl(x, y, ~
test.df %>%
filter(year == .x & origin %in% country.list.continent[[.y]]) %>%
summarise(total = sum(value)) %>%
pull(total)))
# x y sum
# <chr> <chr> <dbl>
# 1 1998 africa 0
# 2 1998 asia 21759
# 3 1998 europe 39700
# 4 1998 latinamerica 10847
# 5 1998 northamerica 21347
# 6 1998 oceania 0
# 7 2000 africa 0
# 8 2000 asia 20059
# 9 2000 europe 35981
#10 2000 latinamerica 8672
#11 2000 northamerica 17324
#12 2000 oceania 0
您已经将数字存储为test.df
中的因子,我们需要将其更改为实际数字。在应用上述方法之前,请运行以下命令。
test.df[-1] <- lapply(test.df[-1], function(x) as.numeric(as.character(x)))
答案 1 :(得分:1)
我们可以在tidyverse
中进行此操作。将已命名的list
转换为两列数据集(enframe
或stack
),然后仅在full_join
处理之后对'test.df'进行filter
包含在“ year.list”中的“年份”,按“名称”,“ year”分组,得到“ {value”的sum
,并spread
变成“ wide”格式
library(tidyverse)
enframe(country.list.continent, value = "origin") %>%
unnest %>%
full_join(test.df %>%
filter(year %in% year.list)) %>%
group_by(name, year) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
spread(year, value, fill = 0) %>%
select(-4)
# A tibble: 6 x 3
# Groups: name [6]
# name `1998` `2000`
# <chr> <dbl> <dbl>
#1 africa 0 0
#2 asia 33038 18485
#3 europe 36658 35874
#4 latinamerica 14323 14808
#5 northamerica 15697 27405
#6 oceania 0 0
或者在base R
中,可以通过stack
将list
设为两列data.frame,merge
,并在{后加上“ test.df” {1}},并使用subset
创建表格
xtabs
xtabs(value ~ ind + year, merge(stack(country.list.continent),
subset(test.df, year %in% year.list), by.x = "values", by.y = "origin"))
# year
#ind 1998 2000
# asia 33038 18485
# africa 0 0
# europe 36658 35874
# oceania 0 0
# northamerica 15697 27405
# latinamerica 14323 14808