Question

我有一个Excel文件，该文件有两列，第一列是给客户的，第二列是从他们产生的收入的。假设我的总收入是1000。我需要将此总收入分成5个存储分区，即总收入的20％（0-200），总收入的40％（200-400），总收入的60％（ 400-600），总收入的80％（600-800）和总收入的100％（800-1000）。我想计算每个存储桶范围内的客户数量，例如，收入总和小于总收入的20％的范围内有多少客户，以此类推，对于其他范围，最后使用条形图绘制它们。我如何在R中做到这一点？以下是示例数据：

 Customer   Revenue
    a          230
    b          170
    c          809
    d          435
    e          678
    f          350
    g          465
    h          990
    i          767
    j          500

Answer 1

原始数据：

df <- tibble(Customer = letters[1:10], Revenue = c(230, 170, 809, 435, 678, 350, 465, 990, 767, 500))

library(dplyr)
library(ggplot2)

通过增加收入排序data.frame

df <- df %>% 
  arrange(Revenue)

使用R base中的cut（）函数添加一个变量，该变量显示累积收入位于5个bin中的哪个。然后，使用x轴绘制垃圾箱，使用y轴绘制该分类变量的值数量。

    df %>% 
      mutate(Revenue_Cumulated = cumsum(Revenue)/sum(Revenue)) %>% 
      mutate(bins = cut(Revenue_Cumulated, breaks = seq(0, 1, 0.2))) %>% 
      group_by(bins) %>% 
      summarise(n = n()) %>% 
      mutate(cumulated_n = cumsum(n)) %>% 

    # data.frame at that point in the code:
    # A tibble: 5 x 3
    #  bins          n cumulated_n
    #  <fct>     <int>       <int>
    # 1 (0,0.2]       3           3
    # 2 (0.2,0.4]     3           6
    # 3 (0.4,0.6]     1           7
    # 4 (0.6,0.8]     1           8
    # 5 (0.8,1]       2          10

    gather(key, value, -bins) %>% 
    ggplot(aes(x = bins, y = value, fill = key)) +
    geom_col(position = "dodge")+
    geom_text(aes(label=value),position=position_dodge(width=0.9),vjust=-0.25)

cumulated_n现在将告诉您有多少客户贡献了0-X百分比。收集功能可以将数据转换为更长的格式，从而更容易将“ n”和“ cumulated_n”视为突出显示图形差异的因素。

Number_customers_by_bin

Answer 2

您可以直接绘制收入的直方图，R会为您进行分箱：

Revenue <- c(230, 170, 809, 435, 678, 350, 465, 990, 767, 500)
hist(Revenue, breaks = seq(0, 1000, 200))

显示范围和总和直到满足条件

2 个答案: