按粒度级别设置数据子集,并将函数应用于R

时间:2019-10-05 18:01:28

标签: r dataframe dplyr subset

好的,这个问题是一个相当漫长而复杂的问题(至少对我而言),我已尽力使这个问题尽可能清晰,有条理和详细,所以请耐心等待...

----------------------------------------------------------------------

我目前在将函数应用于数据的子集时过分手动,我想弄清楚如何使代码更有效。用示例描述问题最简单:

我的数据(myData)中的变量:2017年,2018年和2019年的GDP处于四个粒度级别:大陆,国家/地区,州(或省)和城市。 (注意:GDP数值是任意的;仅用于简化计算)

myData:

   |------|---------------|---------|------------|-------------|------|
   | Year | Continent     | Country | State      | City        | GDP  |
   |------|---------------|---------|------------|-------------|------|
   | 2019 | North America | Canada  | Alberta    | Edmonton    | 13   |
   | 2018 | North America | Canada  | Alberta    | Calgary     | 9    |
   | 2018 | North America | Canada  | Alberta    | Edmonton    | 3    | 
   | 2018 | Asia          | India   | Bihar      | Patna       | 19   |
   | 2018 | Asia          | India   | Bihar      | Gaya        | 8    |
   | 2017 | Asia          | India   | Bihar      | Patna       | 22   | 
   | 2019 | Asia          | India   | Bihar      | Gaya        | 19   |
   | 2019 | Asia          | India   | Bihar      | Patna       | 16   |
   | 2019 | North America | USA     | California | San Diego   | 23   |
   | 2017 | North America | USA     | California | Los Angeles | 18   |
   | 2018 | North America | USA     | California | Los Angeles | 25   |
   | 2018 | North America | USA     | Florida    | Tampa       | 14   |
   | 2019 | North America | USA     | Florida    | Miami       | 19   |
   | 2018 | Asia          | China   | Guangdong  | Shenzhen    | 29   |
   | 2017 | Asia          | China   | Guangdong  | Shenzhen    | 26   |
   | 2019 | Asia          | China   | Guangdong  | Shenzhen    | 33   |
   | 2019 | Asia          | China   | Guangdong  | Guangzhou   | 20   |
   | 2018 | Asia          | China   | Guangdong  | Guangzhou   | 19   |
   | 2018 | North America | Canada  | Quebec     | Montreal    | 11   |
   | 2019 | North America | Canada  | Quebec     | Montreal    | 7    |
   | 2019 | Asia          | China   | Shandong   | Yantai      | 30   |
   | 2019 | Asia          | China   | Shandong   | Jinan       | 16   |
   | 2018 | Asia          | China   | Shandong   | Yantai      | 17   |
   | 2018 | Asia          | China   | Shandong   | Jinan       | 11   |
   | 2019 | Asia          | India   | U.P.       | Allahabad   | 21   |
   | 2018 | Asia          | India   | U.P.       | Agra        | 15   |
   | 2018 | Asia          | India   | U.P.       | Allahabad   | 13   |
   | 2019 | Asia          | India   | U.P.       | Agra        | 18   |
   |------|---------------|---------|------------|-------------|------|

总体目标是在不同的粒度级别上计算GDP分位数(1 = 0-25%,2 = 25%-50%,等等)。这正是我要寻找的:

  • 每年的数量; (将整个数据集细分为3年)
  • 每个大陆的分位数; (按大陆划分子数据)
  • 每个国家/地区的数量; (按大陆和国家划分的子集数据)
  • 每个州的分位数。 (按大陆,国家和省的子集数据)。
  • 每个城市的分位数; (按大陆,国家/地区,州/省和城市划分的子集数据)


在此过程中,我目前有两个步骤:

  1. 每个级别的子集数据。
  2. 计算每个子集的分位数(基于GDP值)。


我们通过在每个级别上相加/相加GDP来进行子集划分。 (注意:此步骤将向下移动到第5级,将生成行数越来越少的数据帧。)这是我所做的,而且相当手动且重复,因此我想找到一种更好的方法:

Level_1.Year <- aggregate(
    GDP ~ 
      Year + 
      Continent + 
      Country + 
      State.Province + 
      City, 
    FUN = sum, 
    data = myData)

Level_2.Continent <- aggregate(
    GDP ~ 
      Continent + 
      Country + 
      State.Province + 
      City, 
    FUN = sum, 
    data = myData)

Level_3.Country <- aggregate(
    GDP ~ 
      Country + 
      State.Province + 
      City, 
    FUN = sum, 
    data = myData)

Level_4.State.Province <- aggregate(
    GDP ~ 
      State.Province + 
      City, 
    FUN = sum, 
    data = myData)

Level_5.City <- aggregate(
    GDP ~ 
      City, 
    FUN = sum, 
    data = myData)

----------------------------------------------------------------------

现在我们有了子集,我们为每个子集计算分位数。由于它们都是不同的长度,并且没有相同的变量,因此我对每个子集求助于手动/重复计算(再次...):

Level_1.Year_quantiles <- Level_1.Year %>% 
        group_by(Year) %>% 
        mutate(Quantile = cut(GDP,
            breaks = quantile(GDP, 
        c(0, 0.25, 0.5, 0.75, 1)), 
            labels = 1:4, 
            include.lowest = TRUE))

Level_2.Continent_quantiles <- Level_2.Continent %>% 
        group_by(Continent) %>% 
        mutate(Quantile = cut(GDP,
            breaks = quantile(GDP, 
        c(0, 0.25, 0.5, 0.75, 1)), 
            labels = 1:4, 
            include.lowest = TRUE))

Level_3.Country_quantiles <- Level_3.Country %>% 
        group_by(Country) %>% 
        mutate(Quantile = cut(GDP,
            breaks = quantile(GDP, 
        c(0, 0.25, 0.5, 0.75, 1)), 
            labels = 1:4, 
            include.lowest = TRUE))
        . 
        .
        .

# All the way through Level_5.City; I think you get the point. 

----------------------------------------------------------------------

是否有办法(1)以更有效的方式对每个级别进行子集处理,然后(2)将每个子集存储在数据帧列表中,然后(3)将分位数添加到列表中的每个数据帧?

如果有更好的方法来完成整个过程,请告诉我!另外,如果您有任何意见或建议,我想听听他们的意见。

2 个答案:

答案 0 :(得分:1)

首先,进行一些澄清:您要调用的子集是分组的摘要。有关更多信息,请参阅“聚合”。第二,对三个问题中的每一个的回答都是肯定的。第三,您的1级摘要相当于您的数据框。

在使用aggregate()时,我将首先说明如何使用aggregate()获取分组汇总的列表:

library(tidyverse)

formula_list <- 
  list(
    GDP ~ Year + Continent + Country + State.Province + City,
    GDP ~ Continent + Country + State.Province + City, 
    GDP ~ Country + State.Province + City, 
    GDP ~ State.Province + City,
    GDP ~ City
    )

summaries <- formula_list %>% 
  map( ~ aggregate(.x, FUN = sum, data = myData))

也可以用完全基于aggregate()的方法替换dplyr。这样做的好处是替换了效率低下的aggregate()。不利之处是我们将不得不处理担保问题,这是一个更为高级的主题(有关更多信息,请咨询vignette("programming"))。

var_combs <- list(
  vars(Year, Continent, Country, State.Province, City),
  vars(Continent, Country, State.Province, City),
  vars(Country, State.Province, City),
  vars(State.Province, City),
  vars(City)) 

summaries <- var_combs %>% 
  map(~ myData %>% 
          group_by(!!!.x) %>% 
          summarize(GDP = sum(GDP)))

接下来是将用于计算四分位数的代码应用于列表的每个元素。当您还要更改分组变量时,我们需要遍历两个列表,因此我们将使用purrr::map2()

grp_var <- list(
  vars(Year),
  vars(Continent),
  vars(Country),
  vars(State.Province),
  vars(City)
)

map2(summaries[1:3], 
     grp_var[1:3], 
     ~ .x %>%  
       group_by(!!!.y) %>% 
       mutate(Quantile = cut(GDP,
                             breaks = quantile(GDP, c(0, 0.25, 0.5, 0.75, 1)), 
                             labels = 1:4, 
                             include.lowest = TRUE))
)

您会注意到,我不得不将列表仅子集为前三个元素。如果一组中只有一个观察值,则您编写的用于计算四分位数的代码将失败(这很有意义:您无法为一个样本的四分位数计算四分位数)。对于五个中的最后一个,情况总是如此,因为根据定义,每个组仅包含一个元素。如果每个组只有两个或三个观察值,那么结果是否特别有意义也值得怀疑。

数据:

myData <- structure(list(
    Year = c(2019, 2019, 2018, 2019, 2019, 2018, 2019, 
            2018, 2018, 2018, 2018, 2018, 2018, 2017, 2017, 2019, 2018, 2019, 
            2019, 2018, 2019, 2017, 2019, 2018, 2018, 2018, 2019, 2019), 
    Continent = c("North America", "Asia", "Asia", "North America", 
                 "Asia", "North America", "Asia", "North America", "Asia", 
                 "North America", "Asia", "Asia", "Asia", "North America", 
                 "Asia", "North America", "Asia", "North America", "Asia", 
                 "North America", "Asia", "Asia", "Asia", "North America", 
                 "Asia", "Asia", "Asia", "Asia"), 
    Country = c("Canada", "India", "India", "USA", "China", "USA", "China", 
               "Canada", "China", "Canada", "India", "India", "China", 
               "USA", "China", "USA", "India", "Canada", "China", "USA", 
               "China", "India", "India", "Canada", "China", "China", 
               "India", "India"), 
    State.Province = c("Alberta", "Uttar Pradesh", "Bihar", "California", 
                      "Shandong", "Florida", "Shandong", "Quebec", "Guangdong", 
                      "Alberta", "Uttar Pradesh", "Bihar", "Shandong", 
                      "California", "Guangdong", "Florida", "Uttar Pradesh", 
                      "Quebec", "Guangdong", "California", "Guangdong", "Bihar", 
                      "Bihar", "Alberta", "Shandong", "Guangdong", "Uttar Pradesh", 
                      "Bihar"), 
    City = c("Edmonton", "Allahabad", "Patna", "Los Angeles", "Yantai", "Miami", 
             "Jinan", "Montreal", "Shenzhen", "Calgary", "Agra", "Gaya", "Yantai", 
             "Los Angeles", "Shenzhen", "Miami", "Allahabad", "Montreal", 
             "Shenzhen", "Los Angeles", "Guangzhou", "Patna", "Gaya", "Edmonton", 
             "Jinan", "Guangzhou", "Agra", "Patna"), 
    GDP = c(13, 21, 19, 23, 30, 14, 16, 11, 29, 9, 15, 8, 17, 18, 26, 19, 13, 7, 
            33, 25, 20, 22, 19, 3, 11, 19, 18, 16)), 
  class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), 
  row.names = c(NA, -28L), 
  spec = structure(list(cols = list(Year = structure(list(), class = c("collector_double", "collector")), 
                                    Continent = structure(list(), class = c("collector_character", "collector")), 
                                    Country = structure(list(), class = c("collector_character", "collector")), 
                                    State.Province = structure(list(), class = c("collector_character", "collector")), 
                                    City = structure(list(), class = c("collector_character", "collector")), 
                                    GDP = structure(list(), class = c("collector_double", "collector"))), 
                        default = structure(list(), class = c("collector_guess", "collector")), 
                        skip = 2), 
                   class = "col_spec"))

答案 1 :(得分:0)

考虑一个应用族解决方案,即lapplyby(包装为tapply的包装和Map(包装为mapply的包装)来处理所有处理列表:

agg_factors <- c("City", "State", "Country", "Continent", "Year")

# NAMED LIST OF DATA FRAMES WHERE FORMULA DYNAMICALLY BUILT AND PASS INTO aggregate()
agg_df_list <- setNames(lapply(seq_along(agg_factors), function(i) {
                              agg_formula <- as.formula(paste("GDP ~", paste(agg_factors[1:i], collapse=" + ")))
                              aggregate(agg_formula, myData, FUN=sum)
                       }), agg_factors)

# FUNCTION TO CALL by() TO RUN FUNCTION ON EACH SUBSET TO BIND TOGETHER AT END
proc_quantiles <- function(df, nm) {

  dfs <- by(df, df[[nm]], function(sub) 
               transform(sub,
                         Quantile = tryCatch(cut(GDP,
                                                 breaks = quantile(GDP, c(0, 0.25, 0.5, 0.75, 1)), 
                                                 labels = 1:4, 
                                                 include.lowest = TRUE), 
                                             error = function(e) NA)
                         )
         ) 

  do.call(rbind, unname(dfs))      
 }

# ELEMENTWISE LOOP THROUGH DFs AND CORRESPONDING NAMES
quantile_df_list <- Map(proc_quantiles, agg_df_list, names(agg_df_list))

输出

head(quantile_df_list$City)
#        City GDP Quantile
# 1      Agra  33       NA
# 2 Allahabad  34       NA
# 3   Calgary   9       NA
# 4  Edmonton  16       NA
# 5      Gaya  27       NA
# 6 Guangzhou  39       NA

head(quantile_df_list$State)
#          City      State GDP Quantile
# 1     Calgary    Alberta   9        1
# 2    Edmonton    Alberta  16        4
# 3        Gaya      Bihar  27        1
# 4       Patna      Bihar  57        4
# 5 Los Angeles California  43        4
# 6   San Diego California  23        1

head(quantile_df_list$Country)
#        City     State Country GDP Quantile
# 1   Calgary   Alberta  Canada   9        1
# 2  Edmonton   Alberta  Canada  16        2
# 3  Montreal    Quebec  Canada  18        4
# 4 Guangzhou Guangdong   China  39        2
# 5  Shenzhen Guangdong   China  88        4
# 6     Jinan  Shandong   China  27        1

head(quantile_df_list$Continent)
#        City     State Country Continent GDP Quantile
# 1 Guangzhou Guangdong   China      Asia  39        3
# 2  Shenzhen Guangdong   China      Asia  88        4
# 3     Jinan  Shandong   China      Asia  27        1
# 4    Yantai  Shandong   China      Asia  47        3
# 5      Gaya     Bihar   India      Asia  27        1
# 6     Patna     Bihar   India      Asia  57        4

head(quantile_df_list$Year)
#          City      State Country     Continent Year GDP Quantile
# 1    Shenzhen  Guangdong   China          Asia 2017  26        4
# 2       Patna      Bihar   India          Asia 2017  22        2
# 3 Los Angeles California     USA North America 2017  18        1
# 4   Guangzhou  Guangdong   China          Asia 2018  19        3
# 5    Shenzhen  Guangdong   China          Asia 2018  29        4
# 6       Jinan   Shandong   China          Asia 2018  11        1