我有一个包含以下列的数据集:
locID = the location of ID of the observer
yr = the year of the observation in categorical format: P_year
maxFlock = a number counted by the observer
lat = latitude of the location
long = longitude of the location
state = US state of the observation
effortDays = categorical, I, II, III, and IV
effortHours = categorical, A, B, C, D
以下是数据框的示例:
PData
locID yr maxFlock lat long state effortDays effortHours
L4278 P_2000 3 41.42 -73.67 NY II C
L4278 P_2000 6 41.42 -73.67 NY III C
L4278 P_2000 4 41.42 -73.67 NY III C
L4278 P_2012 2 41.42 -73.67 NY III B
L4278 P_2012 4 41.42 -73.67 NY IV B
L4278 P_2012 8 41.42 -73.67 NY IV B
L10494 P_2003 4 42.01 -77.44 NY IV C
L10494 P_2003 0 42.01 -77.44 NY IV C
L10494 P_2003 8 42.01 -77.44 NY IV D
L10494 P_2005 4 42.01 -77.44 NY IV C
L10494 P_2005 6 42.01 -77.44 NY IV C
L10494 P_2009 8 42.01 -77.44 NY IV C
我想创建一个新列(标记为:xmf)来计算maxFlock的平均值。但是,必须为locID,yr,effortDays和effortHours的每个唯一组合计算平均值。如果我在上面的示例中运行代码,最终产品将如下所示。
PData
locID yr maxFlock xmf lat long state effortDays effortHours
L4278 P_2000 3 3 41.42 -73.67 NY II C
L4278 P_2000 6 5 41.42 -73.67 NY III C
L4278 P_2000 4 5 41.42 -73.67 NY III C
L4278 P_2012 2 2 41.42 -73.67 NY III B
L4278 P_2012 4 6 41.42 -73.67 NY IV B
L4278 P_2012 8 6 41.42 -73.67 NY IV B
L10494 P_2003 4 2 42.01 -77.44 NY IV C
L10494 P_2003 0 2 42.01 -77.44 NY IV C
L10494 P_2003 8 8 42.01 -77.44 NY IV D
L10494 P_2005 4 5 42.01 -77.44 NY IV C
L10494 P_2005 6 5 42.01 -77.44 NY IV C
L10494 P_2009 8 8 42.01 -77.44 NY IV C
我最初尝试使用以下方式执行此操作:
PData$xmf = ave(myData2$maxFlock, myData2$locID, myData2$yr, myData2$effortDays, myData2$effortHours)
但它不起作用(等了半个多小时后不得不杀掉它),而且我甚至不确定ave()是否可以做我想做的事。
我正在考虑使用split-apply-combine方法尝试一些东西,但我不认为这正是我正在寻找的东西,因为我必须为locID配置子集,然后是子集的年份,然后是努力时间或者努力日,我不想做出那个选择。我想通过独特的组合来做到这一点。
如果有一种快速的方法可以做到这一点也会很棒。我正在使用的数据大约有250万行,所以if循环中的if语句绝对不是理想的。
谢谢!
答案 0 :(得分:1)
来自dplyr
的解决方案。
library(dplyr)
PData <- PData %>%
group_by(locID, yr, effortDays, effortHours) %>%
mutate(xmf = mean(maxFlock)) %>%
select(c(1:3, 9, 4:8))
PData
# A tibble: 12 x 9
# Groups: locID, yr, effortDays, effortHours [8]
locID yr maxFlock xmf lat long state effortDays effortHours
<chr> <chr> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 L4278 P_2000 3 3 41.42 -73.67 NY II C
2 L4278 P_2000 6 5 41.42 -73.67 NY III C
3 L4278 P_2000 4 5 41.42 -73.67 NY III C
4 L4278 P_2012 2 2 41.42 -73.67 NY III B
5 L4278 P_2012 4 6 41.42 -73.67 NY IV B
6 L4278 P_2012 8 6 41.42 -73.67 NY IV B
7 L10494 P_2003 4 2 42.01 -77.44 NY IV C
8 L10494 P_2003 0 2 42.01 -77.44 NY IV C
9 L10494 P_2003 8 8 42.01 -77.44 NY IV D
10 L10494 P_2005 4 5 42.01 -77.44 NY IV C
11 L10494 P_2005 6 5 42.01 -77.44 NY IV C
12 L10494 P_2009 8 8 42.01 -77.44 NY IV C
数据强>
PData <- read.table(text = " locID yr maxFlock lat long state effortDays effortHours
L4278 P_2000 3 41.42 -73.67 NY II C
L4278 P_2000 6 41.42 -73.67 NY III C
L4278 P_2000 4 41.42 -73.67 NY III C
L4278 P_2012 2 41.42 -73.67 NY III B
L4278 P_2012 4 41.42 -73.67 NY IV B
L4278 P_2012 8 41.42 -73.67 NY IV B
L10494 P_2003 4 42.01 -77.44 NY IV C
L10494 P_2003 0 42.01 -77.44 NY IV C
L10494 P_2003 8 42.01 -77.44 NY IV D
L10494 P_2005 4 42.01 -77.44 NY IV C
L10494 P_2005 6 42.01 -77.44 NY IV C
L10494 P_2009 8 42.01 -77.44 NY IV C
",
header = TRUE, stringsAsFactors = FALSE)
答案 1 :(得分:0)
您可以创建一个新列,它将四列(locID,yr,effortDays,effortHours)组合在一起。然后tapply
将新列设为INDEX
,然后只需提取值。
grouping <- paste(PData$locID,
PData$yr,
PData$effortDays,
PData$effortHours, sep = "_")
agg.vals <- tapply(PData$maxFlock, INDEX = grouping, FUN = mean)
PData["xmf"] <- agg.vals[grouping]
答案 2 :(得分:0)
df <- aggregate(PData$maxFlock, by = list(PData$locID, PData$yr, PData$effortDays, PData$effortHours), FUN = mean)
names(df) <- c("locID", "yr", "effortDays", "effortHours", "xmf")
df
locID yr effortDays effortHours xmf
1 L4278 P_2012 III B 2
2 L4278 P_2012 IV B 6
3 L4278 P_2000 II C 3
4 L4278 P_2000 III C 5
5 L10494 P_2003 IV C 2
6 L10494 P_2005 IV C 5
7 L10494 P_2009 IV C 8
8 L10494 P_2003 IV D 8