使用来自大量计数和年份的大型数据集中的摘要填充新的数据框

时间:2019-04-08 21:18:10

标签: r dataframe group-by

我正在尝试教自己如何处理和汇总大型数据集。我想创建一个新的数据框,并在其中填充有关与按位置计数观察相关的年份的摘要数据。我在下面创建了一个示例数据框。

example.frame <- data.frame(
                   "Obs.ID" = 1:50, 
                   "Species" = rep("T. rex", 50),
                   "Site" = c(rep("Big Red", 24), rep("Supermax", 26)),
                   "Site.ID" = c(rep("1578", 24), rep("0185", 26)),
                   "Year" = c(1999, 1999, 1999, 2000, 2001, 2002, 2002,
                              2003, 2003, 2003, 2003, 2003, 2003, 2004,
                              2004, 2004, 2004, 2004, 2005, 2005, 2005, 
                              2006, 2006, 2007, 1978, 1978, 1978, 1978,
                              1979, 1979, 1999, 1999, 2000, 2000, 2000,
                              2000, 2000, 2001, 2001, 2001, 2002, 2003,
                              2003, 2003, 2003, 2004, 2005, 2006, 2006,
                              2006),
                   "Count" = c(0, 1, 5, 0, 3, 1, 1, 0, 1, 3, 2, 1, 1, 0,
                               0, 1, 2, 3, 1, 1, 5, 0, 1, 2, 8, 11, 7,
                               2, 3, 1, 1, 0, 2, 5, 6, 0, 1, 2, 1, 1, 0,
                               0, 2, 3, 1, 2, 0, 1, 2, 1),
                   stringsAsFactors = FALSE)

我想创建一个汇总这些示例数据的新数据框,并包括以下定义的列。

Line.ID –数据帧中以1开头的行的唯一顺序标识符。
网站 – example.data中的网站名称
Total.years –与网站相关的唯一年份总数
Years.3 –至少三个与网站相关联的计数(包括零)的年数
Years.4 (年-4)–至少与该站点相关的四个计数(包括零)的年数。 Years.5 (年5)–包含至少五个或更多站点关联计数(包括零)的年数
Total.pos –大于零的网站至少有一个计数的年总数
排名3 –大于零的网站至少具有三个计数的年数
排名4 –大于零的网站具有至少四个计数的年数
排名5 –大于零的网站至少具有5个计数的年数

新数据框应如下所示:

new.frame <- data.frame(
               "Line.ID" = c(1, 2),
               "Site" = c("Big Red", "Supermax"),
               "Total.years" = c(9, 10),
               "Years.3" = c(4, 5),
               "Years.4" = c(2, 3),
               "Years.5" = c(2, 1),
               "Total.pos" = c(8, 8),
               "Pos.3" = c(3, 5),
               "Pos.4" = c(1, 2),
               "Pos.5" = c(1, 0),
               stringsAsFactors = FALSE)

我认为正确的解决方法是dplyr中“ summarise”和“ group_by”的某种组合,但是我不知道如何将它们组合在一起。我找不到适合这种情况的答案,因此我认为这对发布很有帮助。

后续问题:
如何在创建摘要表时合并一个额外的层(例如,添加在相同站点上出现的另一个物种)?下面的数据框示例。

example.frame.2 <- data.frame(
                     "Obs.ID" = 1:80,
                     "Species" = c(rep("T. rex", 50),
                                   rep("T. bataar", 30)),
                     "Site" = c(rep("Big Red", 24),
                                rep("Supermax", 26),
                                rep("Big Red", 16),
                                rep("Supermax", 10),
                                rep("Oz", 4)),
                     "Site.ID" = c(rep("1578", 24), rep("0185", 26),
                                   rep("1578", 16), rep("0185", 10),
                                   rep("2115", 4)),
                     "Year" = c(1999, 1999, 1999, 2000, 2001, 2002, 2002,
                                2003, 2003, 2003, 2003, 2003, 2003, 2004,
                                2004, 2004, 2004, 2004, 2005, 2005, 2005,
                                2006, 2006, 2007, 1978, 1978, 1978, 1978,
                                1979, 1979, 1999, 1999, 2000, 2000, 2000,
                                2000, 2000, 2001, 2001, 2001, 2002, 2003,
                                2003, 2003, 2003, 2004, 2005, 2006, 2006,
                                2006, 2003, 2003, 2003, 2003, 2003, 2004,
                                2004, 2004, 2004, 2004, 2005, 2005, 2005,
                                2006, 2006, 2007, 1978, 1978, 1978, 1978,
                                1979, 1979, 1999, 1999, 2000, 2000, 2012,
                                2012, 2012, 2013),
                     "Count" = c(0, 1, 5, 0, 3, 1, 1, 0, 1, 3, 2, 1, 1, 0, 0,
                                 1, 2, 3, 1, 1, 5, 0, 1, 2, 8, 11, 7, 2, 3,
                                 1, 1, 0, 2, 5, 6, 0, 1, 2, 1, 1, 0, 0, 2, 3,
                                 1, 2, 0, 1, 2, 1, 1, 3, 2, 1, 1, 0, 0, 1, 2,
                                 3, 1, 1, 5, 0, 1, 2, 8, 11, 7, 2, 3, 1, 1,
                                 0, 2, 5, 1, 1, 3, 0),
                     stringsAsFactors = FALSE)

下面是物种层的摘要数据框。

new.frame.2 <- data.frame(
                 "Line.ID" = c(1, 2, 3, 4, 5),
                 "Species" = c(rep("T. rex", 2), rep("T. bataar", 3)),
                 "Site" = c("Big Red", "Supermax", "Big Red", "Supermax", "Oz"),
                 "Total.years" = c(9, 10, 5, 4, 2),
                 "Years.3" = c(4, 5, 3, 1, 1),
                 "Years.4" = c(2, 3, 2, 1, 0),
                 "Years.5" = c(2, 1, 2, 0, 0),
                 "Total.pos" = c(8, 8, 5, 4, 1),
                 "Pos.3" = c(3, 5, 3, 1, 1),
                 "Pos.4" = c(1, 2, 1, 1, 0),
                 "Pos.5" = c(1, 0, 1, 0, 0),
                 stringsAsFactors = FALSE)

2 个答案:

答案 0 :(得分:1)

我们可以使用table来计数频率,并在单个group_by中进行计数。

library(dplyr)

example.frame %>%
   group_by(Site) %>%
   summarise(Total_Years = n_distinct(Year), 
             Years.3 = sum(table(Year) >= 3), 
             Years.4 = sum(table(Year) >= 4), 
             Years.5 = sum(table(Year) >= 5), 
             Total.Pos = sum(table(Year[Count > 0]) > 0),
             Pos.3 = sum(table(Year[Count > 0]) >= 3),
             Pos.4 = sum(table(Year[Count > 0]) >= 4),
             Pos.5 = sum(table(Year[Count > 0]) >= 5)) %>%
   ungroup() %>%
   mutate(Line.ID = row_number()) %>%
  select(Line.ID, everything())

#  Line.ID Site     Total_Years Years.3 Years.4 Years.5 Total.Pos Pos.3 Pos.4 Pos.5
#    <int> <chr>          <int>   <int>   <int>   <int>     <int> <int> <int> <int>
#1       1 Big Red            9       4       2       2         8     3     1     1
#2       2 Supermax          10       5       3       1         8     5     2     0

对于第二个问题,我们只需要添加一个额外的group_by变量Species,它就可以正常工作。

example.frame.2 %>%
   group_by(Species, Site) %>%
   summarise(Total_Years = n_distinct(Year), 
             Years.3 = sum(table(Year) >= 3), 
             Years.4 = sum(table(Year) >= 4), 
             Years.5 = sum(table(Year) >= 5), 
             Total.Pos = sum(table(Year[Count > 0]) > 0),
             Pos.3 = sum(table(Year[Count > 0]) >= 3),
             Pos.4 = sum(table(Year[Count > 0]) >= 4),
             Pos.5 = sum(table(Year[Count > 0]) >= 5)) %>%
    ungroup() %>%
    mutate(Line.ID = row_number()) %>%
    select(Line.ID, everything())


# Line.ID Species   Site     Total_Years Years.3 Years.4 Years.5 Total.pos Pos.3 Pos.4 Pos.5
#    <int> <chr>     <chr>          <int>   <int>   <int>   <int>     <int> <int> <int> <int>
#1       1 T. bataar Big Red            5       3       2       2         5     3     1     1
#2       2 T. bataar Oz                 2       1       0       0         1     1     0     0
#3       3 T. bataar Supermax           4       1       1       0         4     1     1     0
#4       4 T. rex    Big Red            9       4       2       2         8     3     1     1
#5       5 T. rex    Supermax          10       5       3       1         8     5     2     0

答案 1 :(得分:0)

我们可以通过编程方式进行此操作(假设我们需要进行多次比较)

library(data.table)
# set an identifier for values to compare
n1 <- 3:5
# convert the data.frame to data.table, get the Total_Years, Total_pos
# grouped by Site
dt1 <- setDT(example.frame)[, .(Total_Years = uniqueN(Year),
           Total_pos = sum(tabulate(Year[Count > 0]) > 0)), Site]

# grouped by Site, loop through the Year, 
# Year where 'Count' is greater than 0 with lapply
# get the frequency count with tabulate
# check whether it is greater than or equal to values in n1
# get the sum of logical vector inside Map
# melt into long format 
# dcast the data into wide after doing some transformation
# join with the dt1 on Site
dcast(melt(setnames(example.frame[, lapply(list(Year, Year[Count > 0]),
    function(u) Map(function(x, y) sum(x >= y), list(tabulate(u)), n1)), 
      by = Site], 2:3, c("Year", "Pos")), id.var = "Site")[, 
     variable := paste0(variable, ".", n1)], Site ~ variable)[dt1, on = .(Site)]
#       Site Pos.3 Pos.4 Pos.5 Year.3 Year.4 Year.5 Total_Years Total_pos
#1:  Big Red     3     1     1      4      2      2           9         8
#2: Supermax     5     2     0      5      3      1          10         8

对于第二个数据集,还要在分组变量中添加“种类”,并像以前一样

dt1 <- setDT(example.frame.2)[, .(Total_Years = uniqueN(Year),
       Total_pos = sum(tabulate(Year[Count > 0]) > 0)), .(Species, Site)]

dcast(melt(setnames(example.frame.2[, lapply(list(Year, Year[Count > 0]), 
 function(u) Map(function(x, y) sum(x >= y), list(tabulate(u)), n1)),
   by = .(Species, Site)], 3:4, c("Year", "Pos")),
   id.var = c("Species", "Site"))[, variable := paste0(variable, ".", n1)], 
     Species + Site ~ variable)[dt1, on = .(Species, Site)]
#     Species     Site Pos.3 Pos.4 Pos.5 Year.3 Year.4 Year.5 Total_Years Total_pos
#1:    T. rex  Big Red     3     1     1      4      2      2           9         8
#2:    T. rex Supermax     5     2     0      5      3      1          10         8
#3: T. bataar  Big Red     3     1     1      3      2      2           5         5
#4: T. bataar Supermax     1     1     0      1      1      0           4         4
#5: T. bataar       Oz     1     0     0      1      0      0           2         1

此外,这可以通过tidyverse完成。最好构造一个可重复使用的函数

library(tidyverse)
countFn <- function(data, grpVars, yearCol, countCol, n) {
          yearCol <- enquo(yearCol)
          countCol <- enquo(countCol)

          yearnm <- paste0("Years.", n)
          posnm <- paste0("Pos.", n)

     d1 <- data %>%
             group_by_at(grpVars) %>%
             summarise(Total_Years = n_distinct(!! yearCol),
                 Total_pos = sum(tabulate((!! yearCol)[(!! countCol)> 0]) > 0))

     data %>% 
         group_by_at(grpVars) %>%
         summarise(Col = list(map(list((!! yearCol), (!! yearCol)[(!!countCol) > 0] ),
                  ~ map2_dfc(list(tabulate(.x)), n, 
                            ~ sum(.x >= .y) ) 







              ) %>% map2_dfc(., list(yearnm, posnm), set_names)


              )) %>%          
                   right_join(d1) %>%
                   unnest(Col)



}

-测试

n1 <- 3:5
countFn(example.frame, "Site", Year, Count, n1)
# A tibble: 2 x 9
#  Site     Total_Years Total_pos Years.3 Years.4 Years.5 Pos.3 Pos.4 Pos.5
#  <chr>          <int>     <int>   <int>   <int>   <int> <int> <int> <int>
#1 Big Red            9         8       4       2       2     3     1     1
#2 Supermax          10         8       5       3       1     5     2     0

countFn(example.frame.2, c("Species", "Site"), Year, Count, n1)
# A tibble: 5 x 10
# Groups:   Species [2]
#  Species   Site     Total_Years Total_pos Years.3 Years.4 Years.5 Pos.3 Pos.4 Pos.5
#  <chr>     <chr>          <int>     <int>   <int>   <int>   <int> <int> <int> <int>
#1 T. bataar Big Red            5         5       3       2       2     3     1     1
#2 T. bataar Oz                 2         1       1       0       0     1     0     0
#3 T. bataar Supermax           4         4       1       1       0     1     1     0
#4 T. rex    Big Red            9         8       4       2       2     3     1     1
#5 T. rex    Supermax          10         8       5       3       1     5     2     0