如何按r中的因子水平汇总数据

时间:2017-01-11 08:27:11

标签: r data-manipulation

我有以下数据,我想总结(最小/最大/平均值/中位数/模式/ sd日期按因子水平cluster.kmeans

head(MS.DATA.IMPVAR.KMEANS,10)
     subscribers   arpu     handset3g    mou     rechargesum  cluster.kmeans
 1       105822 197704.10     19040 2854801.0      235430              5
 2        18210  34799.21      2856  419109.0       39820              6
 3        71351 133842.38     13056 2021183.0      157099              3
 4        44975 104681.58      9439 1303220.6      121697              2
 5        75860 133190.55     12605 1714640.8      144262              5
 6        63740 119389.91     11067 1651303.2      143333              1
 7        59368 117792.03     11747 1690910.7      136902              5
 8        40064  80427.09      7217  886214.5       89226              2
 9        51966  99385.52      9972 1407985.7      117353              5
 10       70811 141131.66     12362 1373104.7      158206              4

我尝试使用dplyr,我得到如下:

s_kmeans <- MS.DATA.IMPVAR.KMEANS %>% group_by(cluster.kmeans) %>% summarise_all(c("mean", "median", "min", "max", "sd"))
s_kmeans <- gather(s_kmeans, key, value, -cluster.kmeans)   
s_kmeans$variable <- sapply(strsplit(s_kmeans$key, "_"), `[`,1)    
s_kmeans$stat <- sapply(strsplit(s_kmeans$key, "_"), `[`, 2)    
MS.DATA.STATS.KMEANS <- select(s_kmeans, -key) %>% spread(key = stat, value = value)

head(MS.DATA.STATS.KMEANS)
 A tibble: 6 × 7
   cluster.kmeans    variable       max      mean    median       min
           <fctr>       <chr>     <dbl>     <dbl>     <dbl>     <dbl>
 1              1        arpu  250153.5 164652.99 163718.33  88306.53
 2              1   handset3g   21809.0  13736.38  13598.00   6936.00
 3              1         mou 1143639.1 338834.54 313010.20 116523.59
 4              1 rechargesum  270169.0 173397.03 171897.00  89080.00
 5              1 subscribers   41428.0  26515.01  26321.00  13794.00
 6              2        arpu  163566.9  84552.09  82402.23  29477.03

我想用其他方式做更少的代码行,其中我不使用dplyr ......使用基本的r函数,如by .. aggregate等....

1 个答案:

答案 0 :(得分:3)

目前尚不清楚是否需要更少的代码行或base R。但是,使用当前的Hadleyverse格式,我们可以将代码放在%>%中并使用separate代替两个sapply步骤,以使其更紧凑

library(dplyr)
library(tidyr)
MS.DATA.IMPVAR.KMEANS %>%
    group_by(cluster.kmeans) %>%
    summarise_all(funs(mean, median, min, max, sd)) %>%
    gather(key, value, -cluster.kmeans) %>%
    separate(key, into = c("variable", "stats")) %>% 
    spread(stats, value)