每组创建频率表

时间:2017-11-24 08:45:10

标签: r

在R中,我有一个data.frame,其中包含ID,句点和一些年份给出的变量。

df

df <- structure(list(id = 10001:10010, 
                     period = structure(c(1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), 
                                        .Label = c("2017-01", "2017-02"), class = "factor"), 
                     year1 = c(2013, 2014, 2014, 2017, 1970, 2014, 1970, 2014, 2014, 1970), 
                     year2 = c(2015, NA, 2015, 2015, 1970, 1970, 2014, 2017, 2014, 2017), 
                     year3 = c(2015, 2014, 2015, NA, NA, 2014, 2014, 2015, 1970, 2014), 
                     year4 = c(1970, 2014, 2013, 2014, 1970, 1970, 2014, 2015, 2015, NA)), 
                .Names = c("id", "period", "year1", "year2", "year3", "year4"), 
                row.names = c(NA, -10L), 
                class = "data.frame")

我想要实现的是创建一个data.frame,其中,对于每个时段和每个变量(year1-year4),频率显示特定年份(1970,2013-2017)的显示频率。 期望的结果应该是这样的:

result

result <- structure(list(period = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), 
                                            .Label = c("2017-01", "2017-02"), 
                                            class = "factor"), 
                         category = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), 
                                              .Label = c("year1", "year2", "year3", "year4"), 
                                              class = "factor"), 
                         X1970 = c(1L, 1L, 0L, 2L, 2L, 1L, 1L, 1L), 
                         X2013 = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), 
                         X2014 = c(3L, 0L, 2L, 1L, 2L, 2L, 2L, 2L), 
                         X2015 = c(0L, 3L, 3L, 1L, 0L, 0L, 0L, 1L), 
                         X2017 = c(1L, 2L, 0L, 0L, 0L, 0L, 0L, 0L)), 
                    .Names = c("period", "category", "X1970", "X2013", "X2014", "X2015", "X2017"), 
                    class = "data.frame", 
                    row.names = c(NA, -8L))

我有R的中等经验,我相信这可以实现。但我完全迷失了如何处理这个问题。建议/想法?

2 个答案:

答案 0 :(得分:4)

以下是使用tidyverse

的想法
library(tidyverse)

df %>% 
 gather(category, val, -c(id, period)) %>% 
 na.omit() %>% 
 group_by(period, category, val) %>% 
 summarise(new = n()) %>% 
 spread(val, new, fill = 0)

给出,

# A tibble: 8 x 7
# Groups:   period, category [8]
   period category `1970` `2013` `2014` `2015` `2017`
*  <fctr>    <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 2017-01    year1      1      1      3      0      1
2 2017-01    year2      1      0      0      3      2
3 2017-01    year3      0      0      2      3      0
4 2017-01    year4      2      1      1      1      0
5 2017-02    year1      2      0      2      0      0
6 2017-02    year2      1      0      2      0      0
7 2017-02    year3      1      0      2      0      0
8 2017-02    year4      1      0      2      1      0

这里的主要技巧是转换为长格式,计算频率,然后转换回宽格式。

使用count函数(如@markdly提及)的更短版本可以是,

df %>% 
 gather(category, val, -c(id, period)) %>% 
 na.omit() %>% 
 count(period, category, val) %>% 
 spread(val, n, fill = 0)

答案 1 :(得分:3)

如果我正确理解了您的问题,您可以使用包reshape2实现此目的,首先通过melt将数据从“宽”格式转换为“长”格式,然后通过{{1}将其转回在选择所需的类别时 - 我希望这就是你所需要的。

cast