以下问题让我头痛不已。
我有一个看起来像这样的大数据集。
Name Date C1 C2 C3 C4 C5 C6 C7
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
A 2008-02-03 NA
A 2008-03-10 50
A 2008-07-14 70
A 2008-07-15 NA
A 2009-01-03 40
A 2009-01-05 NA
A 2010-01-07 NA
A 2010-03-03 30
A 2010-03-10 20
A 2011-07-14 10
A 2011-07-15 NA
B 2008-01-03 NA
B 2008-01-05 5
B 2008-01-07 3
B 2008-02-03 11
B 2008-03-10 13
B 2008-07-14 ....
如您所见,我的观察中有很多NA。 其他列看起来相似,数据集有+100.000行。所以它的巨大。
我想要做的是,我希望以下列方式汇总我的数据。 例如C1: 我想在2000-01到2012-12之间的时间范围内为每个姓名以及每个月和每个月建立月平均值。
月平均值应使用每个月的日期计算得出。
完成计算后,我的数据集应如下所示。
Name Date C1 C2 C3 C4 C5 C6 C7
A 2008-01 monthly average
A 2008-02 monthly average
A 2008-03 monthly average
A 2008-04 monthly average
A 2008-05 monthly average
A 2008-06 monthly average
A 2008-07 monthly average
A 2008-08 monthly average
A 2008-09 monthly average
A 2008-10 monthly average
A 2008-11 monthly average
A 2008-12 monthly average
A 2009-01 monthly average
B 2008-01 monthly average
B 2008-02 monthly average
B 2008-03 monthly average
B 2008-04 monthly average
B 2008-05 monthly average
B 2008-06 ....
因此,我的输出数据应显示每年每个月的每个名称。 如果月份只有NA值,或者它们是某个月的月平均值,则值为NA。
例如:
Name Date C1
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
我们期待:
Name Date C1
A 2008-01 (100+120)/2 = 110
例如:
Name Date C1
A 2008-01-03 NA
A 2008-01-05 NA
A 2008-01-07 NA
我们期待:
Name Date C1
A 2008-01 NA
例如:
Name Date C1
A 2008-01-03 100
A 2008-01-05 50
A 2008-01-07 120
我们期待:
Name Date C1
A 2008-01 (100+50+120)/3 = 90
由于我相对较新,我不知道如何解决这个问题,我希望找到一个可以解决这个问题的人,并告诉我这样的事情是如何解决的。 我非常感谢您的支持:)
答案 0 :(得分:2)
library(dplyr)
#generating sample data
data <- data.frame(Name = c(rep("A",25), rep("B",50)),
Date = seq(as.Date("2018-01-01"), as.Date("2020-01-12"), by = 10),
C1 = rep(c(100,NA,NA,NA,NA,500,320,102,412,NA,200,NA,145,800,230),5))
#grouping by Name and Month and summarizing mean of values
data %>%
group_by(Name, month = cut(Date, "month")) %>%
summarise(C1 = mean(C1, na.rm = TRUE)) %>% mutate(C1 = ifelse(is.nan(C1),NA,C1))
答案 1 :(得分:1)
您可以dplyr::summarise_all
计算所有列C1
,C2
等的平均值。
group_by
和Name
上的第一个YearMon
并取消选择Date
列,然后使用summarise_all
library(dplyr)
library(lubridate)
#Added C2 to demonstrate calculation for multiple columns in one go.
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>%
group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
select(-Date) %>%
summarise_all("mean", na.rm=TRUE)
#OR - Use summarise_at and calculate mean for all columns starting with 'C'
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>%
group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
summarise_at(vars(starts_with("C")), mean, na.rm=TRUE)
# A tibble: 12 x 4
# Groups: Name [?]
Name YearMon C1 C2
<chr> <chr> <dbl> <dbl>
1 A 2008-01 110 220
2 A 2008-02 NaN NaN
3 A 2008-03 50.0 100
4 A 2008-07 70.0 140
5 A 2009-01 40.0 80.0
6 A 2010-01 NaN NaN
7 A 2010-03 25.0 50.0
8 A 2011-07 10.0 20.0
9 B 2008-01 4.00 8.00
10 B 2008-02 11.0 22.0
11 B 2008-03 13.0 26.0
12 B 2008-07 NaN NaN
数据:强>
df <- read.table(text =
"Name Date C1
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
A 2008-02-03 NA
A 2008-03-10 50
A 2008-07-14 70
A 2008-07-15 NA
A 2009-01-03 40
A 2009-01-05 NA
A 2010-01-07 NA
A 2010-03-03 30
A 2010-03-10 20
A 2011-07-14 10
A 2011-07-15 NA
B 2008-01-03 NA
B 2008-01-05 5
B 2008-01-07 3
B 2008-02-03 11
B 2008-03-10 13
B 2008-07-14 NA",
header = TRUE, stringsAsFactors = FALSE)