确定变量的中位数,其中连续的0到达不同的变量

时间:2018-05-11 16:12:06

标签: r

以下是我的数据的样子。

 Group, Sales,flag,Count
Paris,6738,0,15
Paris,5235,1,23
Paris,5907,1,15
Paris,5527,0,28
Paris,6934,1,27
Paris,6757,0,20
Paris,5394,1,31
Paris,5379,0,36
Paris,6266,1,40
Paris,5512,1,39
Paris,6506,1,29
Paris,5006,1,22
Paris,6465,1,17
Paris,6653,0,38
Paris,6719,0,12
New York,5333,1,19
New York,6763,1,37
New York,6468,0,32
New York,6923,0,34
New York,6705,0,16
New York,6542,0,11
New York,6497,0,19
New York,6616,0,27
New York,6788,0,26
New York,5876,1,33
New York,5382,0,40
New York,5688,0,34
New York,6667,1,20
New York,5929,1,28
New York,6096,0,30

对于每个城市,我想计算每个城市标志“1”之前和之后的连续零的中位数销售额。

以下是我在使用以下代码后得到的输出,在评论中建议。

setDT(c)[, .(median(Sales), median(Count)), .(City, rleid(flag))][rleid %% 2 == 1, .(City, median = V1, count = V2)]

及以下是使用建议代码后我得到的输出。

head(d,20)
    City  median   count
1: Paris 6738.000 15.00000
2: Paris 5527.000 28.00000
3: Paris 6757.000 20.00000
4: Paris 5379.000 36.00000
5: Paris 6686.000 25.00000
6:    NY 6648.429 23.57143
7:    NY 5535.000 37.00000
8:    NY 6096.000 30.00000

下面附有预期输出。 纽约集团(销售和计数中位数)的差异即将到来

R代码输出结果: 6. NY - 6648.429和Count - 23.57

Excel输出结果: NY - 6616和Count - 26

enter image description here

谢谢, 杰

2 个答案:

答案 0 :(得分:3)

您可以使用rleid中的data.table来计算每Cityrle组的平均值(0和1),然后选择group == 0的位置。

library(data.table)
setDT(data)[, .(mean(Sales), mean(Count)), .(City, rleid(flag))][rleid %% 2 == 1, .(City, average = V1, count = V2)]

    City  average
1: Paris 4000.000
2: Paris 3833.333
3:    NY 4500.000
4:    NY 3500.000

data[, rleid(flag)]输出为:  [1] 1 1 1 2 3 3 3 4 5 5 6 7 7 8

答案 1 :(得分:3)

基础-R

x <- read.csv(header=TRUE, stringsAsFactors=FALSE, text='
City, Sales, flag
Paris, 3000, 0
Paris, 4000, 0
Paris, 5000, 0
Paris, 3000, 1
Paris, 3000, 0
Paris, 4000, 0
Paris, 4500, 0
NY, 3000, 1
NY, 4000, 0
NY, 5000, 0
NY, 3000, 1
NY, 3000, 0
NY, 4000, 0
NY, 4500, 1')

do.call(rbind,
        by(x, list(x$City, cumsum(c(0,diff(x$flag)!=0))),
           function(a) { a$Sales <- mean(a$Sales) ; a[1,,drop=FALSE] ; }))
#     City    Sales flag
# 1  Paris 4000.000    0
# 4  Paris 3000.000    1
# 5  Paris 3833.333    0
# 8     NY 3000.000    1
# 9     NY 4500.000    0
# 11    NY 3000.000    1
# 12    NY 3500.000    0
# 14    NY 4500.000    1

dplyr

library(dplyr)
x %>%
  mutate(flaggroup = cumsum(c(0,diff(flag)!=0))) %>%
  group_by(City, flaggroup) %>%
  summarize(Sales = mean(Sales), flag = first(flag)) %>%
  ungroup() %>%
  select(-flaggroup)
# # A tibble: 8 × 3
#    City    Sales  flag
#   <chr>    <dbl> <int>
# 1    NY 3000.000     1
# 2    NY 4500.000     0
# 3    NY 3000.000     1
# 4    NY 3500.000     0
# 5    NY 4500.000     1
# 6 Paris 4000.000     0
# 7 Paris 3000.000     1
# 8 Paris 3833.333     0