我有以下数据,我想总结(最小/最大/平均值/中位数/模式/ sd日期按因子水平cluster.kmeans
列
head(MS.DATA.IMPVAR.KMEANS,10)
subscribers arpu handset3g mou rechargesum cluster.kmeans
1 105822 197704.10 19040 2854801.0 235430 5
2 18210 34799.21 2856 419109.0 39820 6
3 71351 133842.38 13056 2021183.0 157099 3
4 44975 104681.58 9439 1303220.6 121697 2
5 75860 133190.55 12605 1714640.8 144262 5
6 63740 119389.91 11067 1651303.2 143333 1
7 59368 117792.03 11747 1690910.7 136902 5
8 40064 80427.09 7217 886214.5 89226 2
9 51966 99385.52 9972 1407985.7 117353 5
10 70811 141131.66 12362 1373104.7 158206 4
我尝试使用dplyr,我得到如下:
s_kmeans <- MS.DATA.IMPVAR.KMEANS %>% group_by(cluster.kmeans) %>% summarise_all(c("mean", "median", "min", "max", "sd"))
s_kmeans <- gather(s_kmeans, key, value, -cluster.kmeans)
s_kmeans$variable <- sapply(strsplit(s_kmeans$key, "_"), `[`,1)
s_kmeans$stat <- sapply(strsplit(s_kmeans$key, "_"), `[`, 2)
MS.DATA.STATS.KMEANS <- select(s_kmeans, -key) %>% spread(key = stat, value = value)
head(MS.DATA.STATS.KMEANS)
A tibble: 6 × 7
cluster.kmeans variable max mean median min
<fctr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 arpu 250153.5 164652.99 163718.33 88306.53
2 1 handset3g 21809.0 13736.38 13598.00 6936.00
3 1 mou 1143639.1 338834.54 313010.20 116523.59
4 1 rechargesum 270169.0 173397.03 171897.00 89080.00
5 1 subscribers 41428.0 26515.01 26321.00 13794.00
6 2 arpu 163566.9 84552.09 82402.23 29477.03
我想用其他方式做更少的代码行,其中我不使用dplyr ......使用基本的r函数,如by
.. aggregate
等....
答案 0 :(得分:3)
目前尚不清楚是否需要更少的代码行或base R
。但是,使用当前的Hadleyverse
格式,我们可以将代码放在%>%
中并使用separate
代替两个sapply
步骤,以使其更紧凑
library(dplyr)
library(tidyr)
MS.DATA.IMPVAR.KMEANS %>%
group_by(cluster.kmeans) %>%
summarise_all(funs(mean, median, min, max, sd)) %>%
gather(key, value, -cluster.kmeans) %>%
separate(key, into = c("variable", "stats")) %>%
spread(stats, value)