我正在尝试教自己如何处理和汇总大型数据集。我想创建一个新的数据框,并在其中填充有关与按位置计数观察相关的年份的摘要数据。我在下面创建了一个示例数据框。
example.frame <- data.frame(
"Obs.ID" = 1:50,
"Species" = rep("T. rex", 50),
"Site" = c(rep("Big Red", 24), rep("Supermax", 26)),
"Site.ID" = c(rep("1578", 24), rep("0185", 26)),
"Year" = c(1999, 1999, 1999, 2000, 2001, 2002, 2002,
2003, 2003, 2003, 2003, 2003, 2003, 2004,
2004, 2004, 2004, 2004, 2005, 2005, 2005,
2006, 2006, 2007, 1978, 1978, 1978, 1978,
1979, 1979, 1999, 1999, 2000, 2000, 2000,
2000, 2000, 2001, 2001, 2001, 2002, 2003,
2003, 2003, 2003, 2004, 2005, 2006, 2006,
2006),
"Count" = c(0, 1, 5, 0, 3, 1, 1, 0, 1, 3, 2, 1, 1, 0,
0, 1, 2, 3, 1, 1, 5, 0, 1, 2, 8, 11, 7,
2, 3, 1, 1, 0, 2, 5, 6, 0, 1, 2, 1, 1, 0,
0, 2, 3, 1, 2, 0, 1, 2, 1),
stringsAsFactors = FALSE)
我想创建一个汇总这些示例数据的新数据框,并包括以下定义的列。
Line.ID –数据帧中以1开头的行的唯一顺序标识符。
网站 – example.data中的网站名称
Total.years –与网站相关的唯一年份总数
Years.3 –至少三个与网站相关联的计数(包括零)的年数
Years.4 (年-4)–至少与该站点相关的四个计数(包括零)的年数。
Years.5 (年5)–包含至少五个或更多站点关联计数(包括零)的年数
Total.pos –大于零的网站至少有一个计数的年总数
排名3 –大于零的网站至少具有三个计数的年数
排名4 –大于零的网站具有至少四个计数的年数
排名5 –大于零的网站至少具有5个计数的年数
新数据框应如下所示:
new.frame <- data.frame(
"Line.ID" = c(1, 2),
"Site" = c("Big Red", "Supermax"),
"Total.years" = c(9, 10),
"Years.3" = c(4, 5),
"Years.4" = c(2, 3),
"Years.5" = c(2, 1),
"Total.pos" = c(8, 8),
"Pos.3" = c(3, 5),
"Pos.4" = c(1, 2),
"Pos.5" = c(1, 0),
stringsAsFactors = FALSE)
我认为正确的解决方法是dplyr中“ summarise”和“ group_by”的某种组合,但是我不知道如何将它们组合在一起。我找不到适合这种情况的答案,因此我认为这对发布很有帮助。
后续问题:
如何在创建摘要表时合并一个额外的层(例如,添加在相同站点上出现的另一个物种)?下面的数据框示例。
example.frame.2 <- data.frame(
"Obs.ID" = 1:80,
"Species" = c(rep("T. rex", 50),
rep("T. bataar", 30)),
"Site" = c(rep("Big Red", 24),
rep("Supermax", 26),
rep("Big Red", 16),
rep("Supermax", 10),
rep("Oz", 4)),
"Site.ID" = c(rep("1578", 24), rep("0185", 26),
rep("1578", 16), rep("0185", 10),
rep("2115", 4)),
"Year" = c(1999, 1999, 1999, 2000, 2001, 2002, 2002,
2003, 2003, 2003, 2003, 2003, 2003, 2004,
2004, 2004, 2004, 2004, 2005, 2005, 2005,
2006, 2006, 2007, 1978, 1978, 1978, 1978,
1979, 1979, 1999, 1999, 2000, 2000, 2000,
2000, 2000, 2001, 2001, 2001, 2002, 2003,
2003, 2003, 2003, 2004, 2005, 2006, 2006,
2006, 2003, 2003, 2003, 2003, 2003, 2004,
2004, 2004, 2004, 2004, 2005, 2005, 2005,
2006, 2006, 2007, 1978, 1978, 1978, 1978,
1979, 1979, 1999, 1999, 2000, 2000, 2012,
2012, 2012, 2013),
"Count" = c(0, 1, 5, 0, 3, 1, 1, 0, 1, 3, 2, 1, 1, 0, 0,
1, 2, 3, 1, 1, 5, 0, 1, 2, 8, 11, 7, 2, 3,
1, 1, 0, 2, 5, 6, 0, 1, 2, 1, 1, 0, 0, 2, 3,
1, 2, 0, 1, 2, 1, 1, 3, 2, 1, 1, 0, 0, 1, 2,
3, 1, 1, 5, 0, 1, 2, 8, 11, 7, 2, 3, 1, 1,
0, 2, 5, 1, 1, 3, 0),
stringsAsFactors = FALSE)
下面是物种层的摘要数据框。
new.frame.2 <- data.frame(
"Line.ID" = c(1, 2, 3, 4, 5),
"Species" = c(rep("T. rex", 2), rep("T. bataar", 3)),
"Site" = c("Big Red", "Supermax", "Big Red", "Supermax", "Oz"),
"Total.years" = c(9, 10, 5, 4, 2),
"Years.3" = c(4, 5, 3, 1, 1),
"Years.4" = c(2, 3, 2, 1, 0),
"Years.5" = c(2, 1, 2, 0, 0),
"Total.pos" = c(8, 8, 5, 4, 1),
"Pos.3" = c(3, 5, 3, 1, 1),
"Pos.4" = c(1, 2, 1, 1, 0),
"Pos.5" = c(1, 0, 1, 0, 0),
stringsAsFactors = FALSE)
答案 0 :(得分:1)
我们可以使用table
来计数频率,并在单个group_by
中进行计数。
library(dplyr)
example.frame %>%
group_by(Site) %>%
summarise(Total_Years = n_distinct(Year),
Years.3 = sum(table(Year) >= 3),
Years.4 = sum(table(Year) >= 4),
Years.5 = sum(table(Year) >= 5),
Total.Pos = sum(table(Year[Count > 0]) > 0),
Pos.3 = sum(table(Year[Count > 0]) >= 3),
Pos.4 = sum(table(Year[Count > 0]) >= 4),
Pos.5 = sum(table(Year[Count > 0]) >= 5)) %>%
ungroup() %>%
mutate(Line.ID = row_number()) %>%
select(Line.ID, everything())
# Line.ID Site Total_Years Years.3 Years.4 Years.5 Total.Pos Pos.3 Pos.4 Pos.5
# <int> <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#1 1 Big Red 9 4 2 2 8 3 1 1
#2 2 Supermax 10 5 3 1 8 5 2 0
对于第二个问题,我们只需要添加一个额外的group_by
变量Species
,它就可以正常工作。
example.frame.2 %>%
group_by(Species, Site) %>%
summarise(Total_Years = n_distinct(Year),
Years.3 = sum(table(Year) >= 3),
Years.4 = sum(table(Year) >= 4),
Years.5 = sum(table(Year) >= 5),
Total.Pos = sum(table(Year[Count > 0]) > 0),
Pos.3 = sum(table(Year[Count > 0]) >= 3),
Pos.4 = sum(table(Year[Count > 0]) >= 4),
Pos.5 = sum(table(Year[Count > 0]) >= 5)) %>%
ungroup() %>%
mutate(Line.ID = row_number()) %>%
select(Line.ID, everything())
# Line.ID Species Site Total_Years Years.3 Years.4 Years.5 Total.pos Pos.3 Pos.4 Pos.5
# <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#1 1 T. bataar Big Red 5 3 2 2 5 3 1 1
#2 2 T. bataar Oz 2 1 0 0 1 1 0 0
#3 3 T. bataar Supermax 4 1 1 0 4 1 1 0
#4 4 T. rex Big Red 9 4 2 2 8 3 1 1
#5 5 T. rex Supermax 10 5 3 1 8 5 2 0
答案 1 :(得分:0)
我们可以通过编程方式进行此操作(假设我们需要进行多次比较)
library(data.table)
# set an identifier for values to compare
n1 <- 3:5
# convert the data.frame to data.table, get the Total_Years, Total_pos
# grouped by Site
dt1 <- setDT(example.frame)[, .(Total_Years = uniqueN(Year),
Total_pos = sum(tabulate(Year[Count > 0]) > 0)), Site]
# grouped by Site, loop through the Year,
# Year where 'Count' is greater than 0 with lapply
# get the frequency count with tabulate
# check whether it is greater than or equal to values in n1
# get the sum of logical vector inside Map
# melt into long format
# dcast the data into wide after doing some transformation
# join with the dt1 on Site
dcast(melt(setnames(example.frame[, lapply(list(Year, Year[Count > 0]),
function(u) Map(function(x, y) sum(x >= y), list(tabulate(u)), n1)),
by = Site], 2:3, c("Year", "Pos")), id.var = "Site")[,
variable := paste0(variable, ".", n1)], Site ~ variable)[dt1, on = .(Site)]
# Site Pos.3 Pos.4 Pos.5 Year.3 Year.4 Year.5 Total_Years Total_pos
#1: Big Red 3 1 1 4 2 2 9 8
#2: Supermax 5 2 0 5 3 1 10 8
对于第二个数据集,还要在分组变量中添加“种类”,并像以前一样
dt1 <- setDT(example.frame.2)[, .(Total_Years = uniqueN(Year),
Total_pos = sum(tabulate(Year[Count > 0]) > 0)), .(Species, Site)]
dcast(melt(setnames(example.frame.2[, lapply(list(Year, Year[Count > 0]),
function(u) Map(function(x, y) sum(x >= y), list(tabulate(u)), n1)),
by = .(Species, Site)], 3:4, c("Year", "Pos")),
id.var = c("Species", "Site"))[, variable := paste0(variable, ".", n1)],
Species + Site ~ variable)[dt1, on = .(Species, Site)]
# Species Site Pos.3 Pos.4 Pos.5 Year.3 Year.4 Year.5 Total_Years Total_pos
#1: T. rex Big Red 3 1 1 4 2 2 9 8
#2: T. rex Supermax 5 2 0 5 3 1 10 8
#3: T. bataar Big Red 3 1 1 3 2 2 5 5
#4: T. bataar Supermax 1 1 0 1 1 0 4 4
#5: T. bataar Oz 1 0 0 1 0 0 2 1
此外,这可以通过tidyverse
完成。最好构造一个可重复使用的函数
library(tidyverse)
countFn <- function(data, grpVars, yearCol, countCol, n) {
yearCol <- enquo(yearCol)
countCol <- enquo(countCol)
yearnm <- paste0("Years.", n)
posnm <- paste0("Pos.", n)
d1 <- data %>%
group_by_at(grpVars) %>%
summarise(Total_Years = n_distinct(!! yearCol),
Total_pos = sum(tabulate((!! yearCol)[(!! countCol)> 0]) > 0))
data %>%
group_by_at(grpVars) %>%
summarise(Col = list(map(list((!! yearCol), (!! yearCol)[(!!countCol) > 0] ),
~ map2_dfc(list(tabulate(.x)), n,
~ sum(.x >= .y) )
) %>% map2_dfc(., list(yearnm, posnm), set_names)
)) %>%
right_join(d1) %>%
unnest(Col)
}
-测试
n1 <- 3:5
countFn(example.frame, "Site", Year, Count, n1)
# A tibble: 2 x 9
# Site Total_Years Total_pos Years.3 Years.4 Years.5 Pos.3 Pos.4 Pos.5
# <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#1 Big Red 9 8 4 2 2 3 1 1
#2 Supermax 10 8 5 3 1 5 2 0
countFn(example.frame.2, c("Species", "Site"), Year, Count, n1)
# A tibble: 5 x 10
# Groups: Species [2]
# Species Site Total_Years Total_pos Years.3 Years.4 Years.5 Pos.3 Pos.4 Pos.5
# <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#1 T. bataar Big Red 5 5 3 2 2 3 1 1
#2 T. bataar Oz 2 1 1 0 0 1 0 0
#3 T. bataar Supermax 4 4 1 1 0 1 1 0
#4 T. rex Big Red 9 8 4 2 2 3 1 1
#5 T. rex Supermax 10 8 5 3 1 5 2 0