Question

以数据diamonds为例（仅用于说明问题，无法运行）：

brks <- seq(0, 1, 0.1)   # use fraction as breaks: every top 10%
labs <- seq(10, 100, 10) # name of each label: top%

diamonds %>% 
    group_by(color) %>%
    mutate(bin = cut(diamonds$price, breaks = brks, labels = labs))

我想添加一列以将每一行标记为每个color组的最高价格。

基本R函数cut做类似的工作，但是cut需要特定的休息时间，但是我想用最高价标上价格。

Answer 1

如果您想标记color的每个级别中的十分位（排名组，每个组包含值的10％），则可以这样做：

library(tidyverse)

diamonds = diamonds %>% 
  group_by(color) %>%
  mutate(bin = ntile(price, n=10))

汇总箱：

diamonds %>% 
  group_by(color, bin) %>% 
  summarise(n = n(), 
            mean_price = mean(price))

   color bin    n mean_price
1      D   1  678   559.0310
2      D   2  677   736.2230
3      D   3  678   899.3142
4      D   4  677  1150.0842
...
57     I   7  542  5341.4004
58     I   8  542  7325.0886
59     I   9  542 10572.9207
60     I  10  542 15777.6697

如果您希望bin标签从10到100而不是1到10，将标签乘以10：

  mutate(bin = 10 * ntile(price, n=10))

要回答评论中的后续问题，这是一个选择。我们将数据按color进行划分，以便可以按cut的每个级别内的分位数来color。

diamonds = diamonds %>% 
  split(diamonds$color) %>% 
  map_df(~ .x %>% 
           mutate(price.bins.by.color = cut(price, breaks=quantile(price, probs=c(0, 0.05, 0.2, 0.5, 1)),
                                            labels=c("0%-5%", "5%-20%", "20%-50%", "50%-100%"), include.lowest=TRUE))
  )

diamonds %>% 
  group_by(color, price.bins.by.color) %>% 
  summarise(n = n(),
            mean_price=mean(price)) %>% 
  filter(price.bins.by.color=="20%-50%")

  color price.bins.by.color     n mean_price
1 D     20%-50%              2021      1233.
2 E     20%-50%              2938      1167.
3 F     20%-50%              2848      1448.
4 G     20%-50%              3386      1372.
5 H     20%-50%              2487      1863.
6 I     20%-50%              1626      2163.
7 J     20%-50%               843      2749.

如何减少最高百分比的碎肉？

1 个答案: