通过累积值创建范围

时间:2019-11-22 15:56:12

标签: r

我有DF1

KEY <- c(11,12,22,33,44,55,66,77,88,99,1010,1111,1212,1313,1414,1515,1616,1717,1818,1919,2020)
PRICE <- c(0,0,1,5,7,10,20,80,110,111,200,1000,2500,2799,3215,4999,7896,8968,58914,78422,96352)
DF1 <- data.frame(KEY,PRICE)

我想将DF1分组为多个范围,以累加两列的值(计算KEY列并对PRICE列求和)。这是我希望得到的结果:

INTERVAL <-c('0','UP_TO_10','UP_TO_100','UP_TO_1000','UP_TO_5000','UP_TO_10000','UP_TO_100000')
COUNT_KEY <-c(2,6,8,12,16,18,21)
SUM_PRICE <- c(0,23,123,1544,15057,31921,265609)
DF2 <- data.frame(INTERVAL,COUNT_KEY,SUM_PRICE)

如何制作这张桌子?

3 个答案:

答案 0 :(得分:2)

如果您具有限制或阈值向量,例如:

LIMITS <- c(0, 10, 100, 1000, 5000, 10000, 100000)

您可以获得PRICE低于每个限制的行数:

unlist(lapply(LIMITS, function(x) sum(DF1$PRICE <= x)))
[1]  2  6  8 12 16 18 21

以及这些价格的总和:

unlist(lapply(LIMITS, function(x) sum(DF1$PRICE[DF1$PRICE <= x])))
[1]      0     23    123   1544  15057  31921 265609

这是您的主意吗?

这一切都在一起:

LIMITS <- c(0, 10, 100, 1000, 5000, 10000, 100000)
COUNT_KEY <- unlist(lapply(LIMITS, function(x) sum(DF1$PRICE <= x)))
SUM_PRICE <- unlist(lapply(LIMITS, function(x) sum(DF1$PRICE[DF1$PRICE <= x])))
data.frame(INTERVAL = c(0, paste("UP_TO", LIMITS[-1], sep="_")), COUNT_KEY, SUM_PRICE)

      INTERVAL COUNT_KEY SUM_PRICE
1            0         2         0
2     UP_TO_10         6        23
3    UP_TO_100         8       123
4   UP_TO_1000        12      1544
5   UP_TO_5000        16     15057
6  UP_TO_10000        18     31921
7 UP_TO_100000        21    265609

答案 1 :(得分:2)

您必须先手动定义边界:

X = c(-Inf,0,10,100,1000,5000,10000,100000)

然后使用cut来分配标签的条目。然后我们首先汇总间隔内的计数和总价格。

library(dplyr)

DF1 %>% 
mutate(LABELS = cut(DF1$PRICE,X,INTERVAL,include.lowest =TRUE)) %>%
group_by(LABELS) %>% 
summarise(COUNT_KEY=n(),SUM_PRICE=sum(PRICE)) 

    # A tibble: 7 x 3
  LABELS       COUNT_KEY SUM_PRICE
  <fct>            <int>     <dbl>
1 0                    2         0
2 UP_TO_10             4        23
3 UP_TO_100            2       100
4 UP_TO_1000           4      1421
5 UP_TO_5000           4     13513
6 UP_TO_10000          2     16864
7 UP_TO_100000         3    233688

除了sum_price和counts应该是累积的,这接近于您想要的。因此,可以通过执行mutate_if(is.numeric,cumsum)来实现:

DF1 %>% 
mutate(LABELS = cut(DF1$PRICE,X,INTERVAL,include.lowest =TRUE)) %>% group_by(LABELS) %>% 
summarise(COUNT_KEY=n(),SUM_PRICE=sum(PRICE)) %>% 
mutate_if(is.numeric,cumsum)

提供:

    # A tibble: 7 x 3
  LABELS       COUNT_KEY SUM_PRICE
  <fct>            <int>     <dbl>
1 0                    2         0
2 UP_TO_10             6        23
3 UP_TO_100            8       123
4 UP_TO_1000          12      1544
5 UP_TO_5000          16     15057
6 UP_TO_10000         18     31921
7 UP_TO_100000        21    265609

答案 2 :(得分:1)

好的,这是一种使用dplyr进行处理的多合一整洁方法;)

library(dplyr)

DF1 %>%
  mutate(                                 
    INTERVAL =
      factor(
        case_when(                          # create discrete variable 
          PRICE == 0      ~ '0',
          PRICE <= 10     ~ 'UP_TO_10',
          PRICE <= 100    ~ 'UP_TO_100',
          PRICE <= 1000   ~ 'UP_TO_1000',
          PRICE <= 5000   ~ 'UP_TO_5000',
          PRICE <= 10000  ~ 'UP_TO_10000',
          PRICE <= 100000 ~ 'UP_TO_100000'
        ),
        levels =                            # set the factor levels
          c(
            '0',
            'UP_TO_10',
            'UP_TO_100',
            'UP_TO_1000',
            'UP_TO_5000',
            'UP_TO_10000',
            'UP_TO_100000'
            )
        )
  ) %>% 
  group_by(INTERVAL) %>%                    # create desired group
  summarise(                                # and summary variables
    COUNT_KEY = n(),
    SUM_PRICE = sum(PRICE)
  ) %>%
  mutate(                                   # cumulative totals
    COUNT_KEY_CUM = cumsum(COUNT_KEY),
    SUM_PRICE_CUM = cumsum(SUM_PRICE)
  )