好的,这个问题是一个相当漫长而复杂的问题(至少对我而言),我已尽力使这个问题尽可能清晰,有条理和详细,所以请耐心等待...
----------------------------------------------------------------------
我目前在将函数应用于数据的子集时过分手动,我想弄清楚如何使代码更有效。用示例描述问题最简单:
我的数据(myData)中的变量:2017年,2018年和2019年的GDP处于四个粒度级别:大陆,国家/地区,州(或省)和城市。 (注意:GDP数值是任意的;仅用于简化计算)
myData:
|------|---------------|---------|------------|-------------|------|
| Year | Continent | Country | State | City | GDP |
|------|---------------|---------|------------|-------------|------|
| 2019 | North America | Canada | Alberta | Edmonton | 13 |
| 2018 | North America | Canada | Alberta | Calgary | 9 |
| 2018 | North America | Canada | Alberta | Edmonton | 3 |
| 2018 | Asia | India | Bihar | Patna | 19 |
| 2018 | Asia | India | Bihar | Gaya | 8 |
| 2017 | Asia | India | Bihar | Patna | 22 |
| 2019 | Asia | India | Bihar | Gaya | 19 |
| 2019 | Asia | India | Bihar | Patna | 16 |
| 2019 | North America | USA | California | San Diego | 23 |
| 2017 | North America | USA | California | Los Angeles | 18 |
| 2018 | North America | USA | California | Los Angeles | 25 |
| 2018 | North America | USA | Florida | Tampa | 14 |
| 2019 | North America | USA | Florida | Miami | 19 |
| 2018 | Asia | China | Guangdong | Shenzhen | 29 |
| 2017 | Asia | China | Guangdong | Shenzhen | 26 |
| 2019 | Asia | China | Guangdong | Shenzhen | 33 |
| 2019 | Asia | China | Guangdong | Guangzhou | 20 |
| 2018 | Asia | China | Guangdong | Guangzhou | 19 |
| 2018 | North America | Canada | Quebec | Montreal | 11 |
| 2019 | North America | Canada | Quebec | Montreal | 7 |
| 2019 | Asia | China | Shandong | Yantai | 30 |
| 2019 | Asia | China | Shandong | Jinan | 16 |
| 2018 | Asia | China | Shandong | Yantai | 17 |
| 2018 | Asia | China | Shandong | Jinan | 11 |
| 2019 | Asia | India | U.P. | Allahabad | 21 |
| 2018 | Asia | India | U.P. | Agra | 15 |
| 2018 | Asia | India | U.P. | Allahabad | 13 |
| 2019 | Asia | India | U.P. | Agra | 18 |
|------|---------------|---------|------------|-------------|------|
总体目标是在不同的粒度级别上计算GDP分位数(1 = 0-25%,2 = 25%-50%,等等)。这正是我要寻找的:
每个城市的分位数; (按大陆,国家/地区,州/省和城市划分的子集数据)
在此过程中,我目前有两个步骤:
计算每个子集的分位数(基于GDP值)。
我们通过在每个级别上相加/相加GDP来进行子集划分。 (注意:此步骤将向下移动到第5级,将生成行数越来越少的数据帧。)这是我所做的,而且相当手动且重复,因此我想找到一种更好的方法:
Level_1.Year <- aggregate(
GDP ~
Year +
Continent +
Country +
State.Province +
City,
FUN = sum,
data = myData)
Level_2.Continent <- aggregate(
GDP ~
Continent +
Country +
State.Province +
City,
FUN = sum,
data = myData)
Level_3.Country <- aggregate(
GDP ~
Country +
State.Province +
City,
FUN = sum,
data = myData)
Level_4.State.Province <- aggregate(
GDP ~
State.Province +
City,
FUN = sum,
data = myData)
Level_5.City <- aggregate(
GDP ~
City,
FUN = sum,
data = myData)
----------------------------------------------------------------------
现在我们有了子集,我们为每个子集计算分位数。由于它们都是不同的长度,并且没有相同的变量,因此我对每个子集求助于手动/重复计算(再次...):
Level_1.Year_quantiles <- Level_1.Year %>%
group_by(Year) %>%
mutate(Quantile = cut(GDP,
breaks = quantile(GDP,
c(0, 0.25, 0.5, 0.75, 1)),
labels = 1:4,
include.lowest = TRUE))
Level_2.Continent_quantiles <- Level_2.Continent %>%
group_by(Continent) %>%
mutate(Quantile = cut(GDP,
breaks = quantile(GDP,
c(0, 0.25, 0.5, 0.75, 1)),
labels = 1:4,
include.lowest = TRUE))
Level_3.Country_quantiles <- Level_3.Country %>%
group_by(Country) %>%
mutate(Quantile = cut(GDP,
breaks = quantile(GDP,
c(0, 0.25, 0.5, 0.75, 1)),
labels = 1:4,
include.lowest = TRUE))
.
.
.
# All the way through Level_5.City; I think you get the point.
----------------------------------------------------------------------
是否有办法(1)以更有效的方式对每个级别进行子集处理,然后(2)将每个子集存储在数据帧列表中,然后(3)将分位数添加到列表中的每个数据帧?
如果有更好的方法来完成整个过程,请告诉我!另外,如果您有任何意见或建议,我想听听他们的意见。
答案 0 :(得分:1)
首先,进行一些澄清:您要调用的子集是分组的摘要。有关更多信息,请参阅“聚合”。第二,对三个问题中的每一个的回答都是肯定的。第三,您的1级摘要相当于您的数据框。
在使用aggregate()
时,我将首先说明如何使用aggregate()
获取分组汇总的列表:
library(tidyverse)
formula_list <-
list(
GDP ~ Year + Continent + Country + State.Province + City,
GDP ~ Continent + Country + State.Province + City,
GDP ~ Country + State.Province + City,
GDP ~ State.Province + City,
GDP ~ City
)
summaries <- formula_list %>%
map( ~ aggregate(.x, FUN = sum, data = myData))
也可以用完全基于aggregate()
的方法替换dplyr
。这样做的好处是替换了效率低下的aggregate()
。不利之处是我们将不得不处理担保问题,这是一个更为高级的主题(有关更多信息,请咨询vignette("programming")
)。
var_combs <- list(
vars(Year, Continent, Country, State.Province, City),
vars(Continent, Country, State.Province, City),
vars(Country, State.Province, City),
vars(State.Province, City),
vars(City))
summaries <- var_combs %>%
map(~ myData %>%
group_by(!!!.x) %>%
summarize(GDP = sum(GDP)))
接下来是将用于计算四分位数的代码应用于列表的每个元素。当您还要更改分组变量时,我们需要遍历两个列表,因此我们将使用purrr::map2()
:
grp_var <- list(
vars(Year),
vars(Continent),
vars(Country),
vars(State.Province),
vars(City)
)
map2(summaries[1:3],
grp_var[1:3],
~ .x %>%
group_by(!!!.y) %>%
mutate(Quantile = cut(GDP,
breaks = quantile(GDP, c(0, 0.25, 0.5, 0.75, 1)),
labels = 1:4,
include.lowest = TRUE))
)
您会注意到,我不得不将列表仅子集为前三个元素。如果一组中只有一个观察值,则您编写的用于计算四分位数的代码将失败(这很有意义:您无法为一个样本的四分位数计算四分位数)。对于五个中的最后一个,情况总是如此,因为根据定义,每个组仅包含一个元素。如果每个组只有两个或三个观察值,那么结果是否特别有意义也值得怀疑。
数据:
myData <- structure(list(
Year = c(2019, 2019, 2018, 2019, 2019, 2018, 2019,
2018, 2018, 2018, 2018, 2018, 2018, 2017, 2017, 2019, 2018, 2019,
2019, 2018, 2019, 2017, 2019, 2018, 2018, 2018, 2019, 2019),
Continent = c("North America", "Asia", "Asia", "North America",
"Asia", "North America", "Asia", "North America", "Asia",
"North America", "Asia", "Asia", "Asia", "North America",
"Asia", "North America", "Asia", "North America", "Asia",
"North America", "Asia", "Asia", "Asia", "North America",
"Asia", "Asia", "Asia", "Asia"),
Country = c("Canada", "India", "India", "USA", "China", "USA", "China",
"Canada", "China", "Canada", "India", "India", "China",
"USA", "China", "USA", "India", "Canada", "China", "USA",
"China", "India", "India", "Canada", "China", "China",
"India", "India"),
State.Province = c("Alberta", "Uttar Pradesh", "Bihar", "California",
"Shandong", "Florida", "Shandong", "Quebec", "Guangdong",
"Alberta", "Uttar Pradesh", "Bihar", "Shandong",
"California", "Guangdong", "Florida", "Uttar Pradesh",
"Quebec", "Guangdong", "California", "Guangdong", "Bihar",
"Bihar", "Alberta", "Shandong", "Guangdong", "Uttar Pradesh",
"Bihar"),
City = c("Edmonton", "Allahabad", "Patna", "Los Angeles", "Yantai", "Miami",
"Jinan", "Montreal", "Shenzhen", "Calgary", "Agra", "Gaya", "Yantai",
"Los Angeles", "Shenzhen", "Miami", "Allahabad", "Montreal",
"Shenzhen", "Los Angeles", "Guangzhou", "Patna", "Gaya", "Edmonton",
"Jinan", "Guangzhou", "Agra", "Patna"),
GDP = c(13, 21, 19, 23, 30, 14, 16, 11, 29, 9, 15, 8, 17, 18, 26, 19, 13, 7,
33, 25, 20, 22, 19, 3, 11, 19, 18, 16)),
class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"),
row.names = c(NA, -28L),
spec = structure(list(cols = list(Year = structure(list(), class = c("collector_double", "collector")),
Continent = structure(list(), class = c("collector_character", "collector")),
Country = structure(list(), class = c("collector_character", "collector")),
State.Province = structure(list(), class = c("collector_character", "collector")),
City = structure(list(), class = c("collector_character", "collector")),
GDP = structure(list(), class = c("collector_double", "collector"))),
default = structure(list(), class = c("collector_guess", "collector")),
skip = 2),
class = "col_spec"))
答案 1 :(得分:0)
考虑一个应用族解决方案,即lapply
,by
(包装为tapply
的包装和Map
(包装为mapply
的包装)来处理所有处理列表:
agg_factors <- c("City", "State", "Country", "Continent", "Year")
# NAMED LIST OF DATA FRAMES WHERE FORMULA DYNAMICALLY BUILT AND PASS INTO aggregate()
agg_df_list <- setNames(lapply(seq_along(agg_factors), function(i) {
agg_formula <- as.formula(paste("GDP ~", paste(agg_factors[1:i], collapse=" + ")))
aggregate(agg_formula, myData, FUN=sum)
}), agg_factors)
# FUNCTION TO CALL by() TO RUN FUNCTION ON EACH SUBSET TO BIND TOGETHER AT END
proc_quantiles <- function(df, nm) {
dfs <- by(df, df[[nm]], function(sub)
transform(sub,
Quantile = tryCatch(cut(GDP,
breaks = quantile(GDP, c(0, 0.25, 0.5, 0.75, 1)),
labels = 1:4,
include.lowest = TRUE),
error = function(e) NA)
)
)
do.call(rbind, unname(dfs))
}
# ELEMENTWISE LOOP THROUGH DFs AND CORRESPONDING NAMES
quantile_df_list <- Map(proc_quantiles, agg_df_list, names(agg_df_list))
输出
head(quantile_df_list$City)
# City GDP Quantile
# 1 Agra 33 NA
# 2 Allahabad 34 NA
# 3 Calgary 9 NA
# 4 Edmonton 16 NA
# 5 Gaya 27 NA
# 6 Guangzhou 39 NA
head(quantile_df_list$State)
# City State GDP Quantile
# 1 Calgary Alberta 9 1
# 2 Edmonton Alberta 16 4
# 3 Gaya Bihar 27 1
# 4 Patna Bihar 57 4
# 5 Los Angeles California 43 4
# 6 San Diego California 23 1
head(quantile_df_list$Country)
# City State Country GDP Quantile
# 1 Calgary Alberta Canada 9 1
# 2 Edmonton Alberta Canada 16 2
# 3 Montreal Quebec Canada 18 4
# 4 Guangzhou Guangdong China 39 2
# 5 Shenzhen Guangdong China 88 4
# 6 Jinan Shandong China 27 1
head(quantile_df_list$Continent)
# City State Country Continent GDP Quantile
# 1 Guangzhou Guangdong China Asia 39 3
# 2 Shenzhen Guangdong China Asia 88 4
# 3 Jinan Shandong China Asia 27 1
# 4 Yantai Shandong China Asia 47 3
# 5 Gaya Bihar India Asia 27 1
# 6 Patna Bihar India Asia 57 4
head(quantile_df_list$Year)
# City State Country Continent Year GDP Quantile
# 1 Shenzhen Guangdong China Asia 2017 26 4
# 2 Patna Bihar India Asia 2017 22 2
# 3 Los Angeles California USA North America 2017 18 1
# 4 Guangzhou Guangdong China Asia 2018 19 3
# 5 Shenzhen Guangdong China Asia 2018 29 4
# 6 Jinan Shandong China Asia 2018 11 1