我想使用data.frame
在我的dplyr
对象中对数字变量进行分类(并且不知道如何操作)。
如果没有dplyr
,我可能会做类似的事情:
df <- data.frame(a = rnorm(1e3), b = rnorm(1e3))
df$a <- cut(df$a , breaks=quantile(df$a, probs = seq(0, 1, 0.2)))
它会完成。但是,我强烈希望在我dplyr
执行的mutate
其他操作序列中使用一些chain
函数(我认为data.frame
)。 。
答案 0 :(得分:24)
set.seed(123)
df <- data.frame(a = rnorm(10), b = rnorm(10))
df %>% mutate(a = cut(a, breaks = quantile(a, probs = seq(0, 1, 0.2))))
,并提供:
a b
1 (-0.586,-0.316] 1.2240818
2 (-0.316,0.094] 0.3598138
3 (0.68,1.72] 0.4007715
4 (-0.316,0.094] 0.1106827
5 (0.094,0.68] -0.5558411
6 (0.68,1.72] 1.7869131
7 (0.094,0.68] 0.4978505
8 <NA> -1.9666172
9 (-1.27,-0.586] 0.7013559
10 (-0.586,-0.316] -0.4727914
答案 1 :(得分:8)
ggplot2
软件包具有3个功能,可以很好地完成以下任务:
cut_number()
:使n个组具有(大约)相等的观察次数cut_interval()
:使n个群组的距离相等cut_width
:按宽度,宽度进行分组我要去的是cut_number()
,因为它使用间隔均匀的分位数对观察值进行分箱。这是数据偏斜的示例。
library(tidyverse)
skewed_tbl <- tibble(
counts = c(1:100, 1:50, 1:20, rep(1:10, 3),
rep(1:5, 5), rep(1:2, 10), rep(1, 20))
) %>%
mutate(
counts_cut_number = cut_number(counts, n = 4),
counts_cut_interval = cut_interval(counts, n = 4),
counts_cut_width = cut_width(counts, width = 25)
)
# Data
skewed_tbl
#> # A tibble: 265 x 4
#> counts counts_cut_number counts_cut_interval counts_cut_width
#> <dbl> <fct> <fct> <fct>
#> 1 1 [1,3] [1,25.8] [-12.5,12.5]
#> 2 2 [1,3] [1,25.8] [-12.5,12.5]
#> 3 3 [1,3] [1,25.8] [-12.5,12.5]
#> 4 4 (3,13] [1,25.8] [-12.5,12.5]
#> 5 5 (3,13] [1,25.8] [-12.5,12.5]
#> 6 6 (3,13] [1,25.8] [-12.5,12.5]
#> 7 7 (3,13] [1,25.8] [-12.5,12.5]
#> 8 8 (3,13] [1,25.8] [-12.5,12.5]
#> 9 9 (3,13] [1,25.8] [-12.5,12.5]
#> 10 10 (3,13] [1,25.8] [-12.5,12.5]
#> # ... with 255 more rows
summary(skewed_tbl$counts)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 3.00 13.00 25.75 42.00 100.00
# Histogram showing skew
skewed_tbl %>%
ggplot(aes(counts)) +
geom_histogram(bins = 30)
# cut_number() evenly distributes observations into bins by quantile
skewed_tbl %>%
ggplot(aes(counts_cut_number)) +
geom_bar()
# cut_interval() evenly splits the interval across the range
skewed_tbl %>%
ggplot(aes(counts_cut_interval)) +
geom_bar()
# cut_width() uses the width = 25 to create bins that are 25 in width
skewed_tbl %>%
ggplot(aes(counts_cut_width)) +
geom_bar()
由reprex package(v0.2.1)于2018-11-01创建