以下是我的数据的样子。
Group, Sales,flag,Count
Paris,6738,0,15
Paris,5235,1,23
Paris,5907,1,15
Paris,5527,0,28
Paris,6934,1,27
Paris,6757,0,20
Paris,5394,1,31
Paris,5379,0,36
Paris,6266,1,40
Paris,5512,1,39
Paris,6506,1,29
Paris,5006,1,22
Paris,6465,1,17
Paris,6653,0,38
Paris,6719,0,12
New York,5333,1,19
New York,6763,1,37
New York,6468,0,32
New York,6923,0,34
New York,6705,0,16
New York,6542,0,11
New York,6497,0,19
New York,6616,0,27
New York,6788,0,26
New York,5876,1,33
New York,5382,0,40
New York,5688,0,34
New York,6667,1,20
New York,5929,1,28
New York,6096,0,30
对于每个城市,我想计算每个城市标志“1”之前和之后的连续零的中位数销售额。
以下是我在使用以下代码后得到的输出,在评论中建议。
setDT(c)[, .(median(Sales), median(Count)), .(City, rleid(flag))][rleid %% 2 == 1, .(City, median = V1, count = V2)]
及以下是使用建议代码后我得到的输出。
head(d,20)
City median count
1: Paris 6738.000 15.00000
2: Paris 5527.000 28.00000
3: Paris 6757.000 20.00000
4: Paris 5379.000 36.00000
5: Paris 6686.000 25.00000
6: NY 6648.429 23.57143
7: NY 5535.000 37.00000
8: NY 6096.000 30.00000
下面附有预期输出。 纽约集团(销售和计数中位数)的差异即将到来
R代码输出结果: 6. NY - 6648.429和Count - 23.57
Excel输出结果: NY - 6616和Count - 26
谢谢, 杰
答案 0 :(得分:3)
您可以使用rleid
中的data.table
来计算每City
和rle
组的平均值(0和1),然后选择group == 0
的位置。
library(data.table)
setDT(data)[, .(mean(Sales), mean(Count)), .(City, rleid(flag))][rleid %% 2 == 1, .(City, average = V1, count = V2)]
City average
1: Paris 4000.000
2: Paris 3833.333
3: NY 4500.000
4: NY 3500.000
data[, rleid(flag)]
输出为:
[1] 1 1 1 2 3 3 3 4 5 5 6 7 7 8
答案 1 :(得分:3)
x <- read.csv(header=TRUE, stringsAsFactors=FALSE, text='
City, Sales, flag
Paris, 3000, 0
Paris, 4000, 0
Paris, 5000, 0
Paris, 3000, 1
Paris, 3000, 0
Paris, 4000, 0
Paris, 4500, 0
NY, 3000, 1
NY, 4000, 0
NY, 5000, 0
NY, 3000, 1
NY, 3000, 0
NY, 4000, 0
NY, 4500, 1')
do.call(rbind,
by(x, list(x$City, cumsum(c(0,diff(x$flag)!=0))),
function(a) { a$Sales <- mean(a$Sales) ; a[1,,drop=FALSE] ; }))
# City Sales flag
# 1 Paris 4000.000 0
# 4 Paris 3000.000 1
# 5 Paris 3833.333 0
# 8 NY 3000.000 1
# 9 NY 4500.000 0
# 11 NY 3000.000 1
# 12 NY 3500.000 0
# 14 NY 4500.000 1
dplyr
library(dplyr)
x %>%
mutate(flaggroup = cumsum(c(0,diff(flag)!=0))) %>%
group_by(City, flaggroup) %>%
summarize(Sales = mean(Sales), flag = first(flag)) %>%
ungroup() %>%
select(-flaggroup)
# # A tibble: 8 × 3
# City Sales flag
# <chr> <dbl> <int>
# 1 NY 3000.000 1
# 2 NY 4500.000 0
# 3 NY 3000.000 1
# 4 NY 3500.000 0
# 5 NY 4500.000 1
# 6 Paris 4000.000 0
# 7 Paris 3000.000 1
# 8 Paris 3833.333 0