每年观察数占总观察数的比例

时间:2019-12-20 12:56:55

标签: r aggregate

我在R中具有以下数据框:

    Year   ID
1   2018   x
2   2018   x
3   2018   y
4   2018   z
5   2019   x
6   2019   x
7   2019   z     

我想分别计算每年在“ ID”列中总观测值中“ x”的份额。

结果应如下所示:

Year   Share of x
2018   50 %
2019   67 %

是否可以用aggregate来做到这一点,

aggregate(length(which(df$ID == x)) / length(df$ID), by=Year)

或其他任何功能?

4 个答案:

答案 0 :(得分:1)

假设最后在“注释”中可重复显示的数据使用table计算计数,然后使用prop.table计算每个数据占其行的比例。

prop.table(table(dat), 1)
##       ID
## Year           x         y         z
##   2018 0.5000000 0.2500000 0.2500000
##   2019 0.6666667 0.0000000 0.3333333

或者如果您希望各列的比例:

prop.table(table(dat), 2)
##       ID
## Year     x   y   z
##   2018 0.5 1.0 0.5
##   2019 0.5 0.0 0.5

汇总

关于问题上的aggregate标签,第一种情况可以这样进行:

aggregate(ID ~ Year, dat, 
  function(id) sapply(unique(dat$ID), function(x) setNames(mean(id == x), x)))
##   Year      ID.x      ID.y      ID.z
## 1 2018 0.5000000 0.2500000 0.2500000
## 2 2019 0.6666667 0.0000000 0.3333333

或同时使用aggregatetable

aggregate(ID ~ Year, dat, function(x) table(x) / length(x))
##   Year      ID.x ID.y      ID.z
## 1 2018 0.5000000 0.25 0.2500000
## 2 2019 0.6666667 0.00 0.3333333

dplyr / tidyr

library(dplyr)
library(tidyr)

dat %>%
  count(Year, ID) %>%
  group_by(Year) %>%
  mutate(prop = n / sum(n)) %>%
  pivot_wider(-n, names_from = "ID", values_from = "prop", values_fill = list(prop = 0))

## # A tibble: 2 x 4
## # Groups:   Year [2]
##    Year     x     y     z
##   <int> <dbl> <dbl> <dbl>
## 1  2018 0.5    0.25 0.25 
## 2  2019 0.667  0    0.333

注意

Lines <- "    Year   ID
1   2018   x
2   2018   x
3   2018   y
4   2018   z
5   2019   x
6   2019   x
7   2019   z     "
dat <- read.table(text = Lines)

答案 1 :(得分:0)

也许你想这样做

dfout<- setNames(aggregate(ID~Year,df,function(v) sum(v=="x")/length(v)*100),
                 c("Year","Share of x"))

如此

> dfout
  Year Share of x
1 2018   50.00000
2 2019   66.66667

数据

df <-structure(list(Year = c(2018L, 2018L, 2018L, 2018L, 2019L, 2019L, 
2019L), ID = c("x", "x", "y", "z", "x", "x", "z")), class = "data.frame", row.names = c(NA, 
-7L))

答案 2 :(得分:0)

Tidyverse方法:

library(tidyverse)

data<- tribble(~year,~id,
               2018,"x",
               2018,"x",
               2018,"y",
               2018,"z",
               2019,"x",
               2019,"x",
               2019,"z"

)


agg <- data %>% group_by(year,id) %>% 
            summarise(cnt_id = n()) %>% # count id per year
            group_by(year) %>% 
            mutate(cnt_obs = sum(cnt_id),#count total obs per year
                   share = cnt_id/cnt_obs) %>% 
                    filter(id=="x") %>% 
                    select(year,id,share)
head(agg)
   year id    share
  <dbl> <chr> <dbl>
1  2018 x     0.5  
2  2019 x     0.667

答案 3 :(得分:0)

我认为2019y缺失了,但仍然

library(tidyverse)

df<- tribble(~year,~id,
               2018,"x",
               2018,"x",
               2018,"y",
               2018,"z",
               2019,"x",
               2019,"x",
               2019,"z"

)

df %>% 
  group_by(year,id) %>% 
  tally() %>% 
  group_by(year) %>% 
  mutate(prop = n/sum(n)) %>% 
  ungroup() %>% 
  select(-n) %>% 
  pivot_wider(names_from = id,values_from = prop) %>% 
  mutate_all(~ replace_na(.,replace = 0))