如何在R中找到百分位数然后分组

时间:2017-02-15 06:47:56

标签: r dataframe dplyr percentile

我有一个如下所示的数据框(df)。

day  area   hour  time  count
___  ____  _____  ___   ____
 1    1      0     1     10
 1    1      0     2     12
 1    1      0     3     8
 1    1      0     4     12    
 1    1      0     5     15  
 1    1      0     6     18 
 1    1      1     1     10
 1    1      1     2     12
 1    1      1     3     8
 1    1      1     4     12    
 1    1      1     5     15  
 1    1      1     6     18
 1    1      1     7     12    
 1    1      1     8     15  
 1    1      1     9     18
 1    1      2     1     10    
 1    1      2     2     18  
 1    1      2     3     19
 .....
 2    1      0     1     18
 2    1      0     2     12
 2    1      0     3     18
 2    1      0     4     12    
 2    1      1     1     8
 2    1      1     2     12
 2    1      1     3     18
 2    1      1     4     10    
 2    1      1     5     15  
 2    1      1     6     18
 2    1      1     7     12    
 2    1      1     8     15  
 2    1      1     9     18
 2    1      2     1     10    
 2    1      2     2     18  
 2    1      2     3     19
 2    1      2     4     9    
 2    1      2     5     18  
 2    1      2     6     9


..... 
 30    99      23     1     9    
 30    99      23     2     8  
 30    99      23     3     9
 30    99      23     4     19    
 30    99      23     5     18  
 30    99      23     6     9
 30    99      23     7     19    
 30    99      23     8     8  
 30    99      23     9     19

这里我有87天(1到82,然后我有90,93,95,97,99)和24小时(0到23)每天30天的数据。所以数据是关于时间的穿过该地区,有多少人越过了。

例如:

day  area   hour  time  count
___  ____  _____  ___   ____
 1    1      0     1     10
 1    1      0     2     12
 1    1      0     3     8
 1    1      0     4     12    
 1    1      0     5     15  
 1    1      0     6     18 

这给了我第0小时的第1天,越过1区的时间

time  count   cumulative_count
___    ___    ________________
 1     10           10
 2     12           22
 3     8            30
 4     12           42    
 5     15           57
 6     18           75 
10 vehicles crossed the area in 1 minute.
12 vehicles crossed the area in 2 minutes.
8 vehicles crossed the area in 3 minutes.
12 vehicles crossed the area in 4 minutes.
15 vehicles crossed the area in 5 minutes.
18 vehicles crossed the area in 6 minutes.

由此我想计算80%的车辆在第1小时内穿越1区的时间。因此总车辆=(10 + 12 + 8 + 12 + 15 + 18)= 75。因此75%的80%是60.因此80%的车辆(75%的80%,60%)在第1天0时通过区域1的时间将是5到6之间(将接近5)。所以结果就像:

 day  area   hour    time_taken_for_80%vehicles_to_pass
    ___  ____   ____    ___________________________________
     1    1      0                5.33(approximately)
     1    1      1                7.30
     1    1      2                2.16
    ....
     30   1      23               3.13
     1    2      0                ---
     1    2      1                ---
     1    2      2                ---
     1    2      3                ---

 .......

     30    99     21              ---
     30    99     22              ---
     30    99     23              ---

   I know to I have to take quantile and then group by the area and day and hour.So I tried with 

library(dplyr)
grp <- group_by(df, day,area,hour,quantile(df$count,0.8))

但它不起作用。感谢任何帮助

1 个答案:

答案 0 :(得分:1)

我的解决方案计算每个time越过该区域的车辆百分比。然后获得第一个time百分比超过80%:

str <- 'day  area   hour  time  count
1    1      0     1     10
1    1      0     2     12
1    1      0     3     8
1    1      0     4     12    
1    1      0     5     15  
1    1      0     6     18
1    1      1     1     10
1    1      1     2     12
1    1      1     3     8
1    1      1     4     12    
1    1      1     5     15  
1    1      1     6     18
1    1      1     7     12    
1    1      1     8     15  
1    1      1     9     18
1    1      2     1     10    
1    1      2     2     18  
1    1      2     3     19'



file <- textConnection(str)
df <- read.table(file, header = T)

df

library(dplyr)
df %>% group_by(day, area, hour) %>%
  mutate(cumcount = cumsum(count),
         p = cumcount/max(cumcount)) %>%
  filter(p > 0.8) %>%
  summarise(time = min(time))

结果:

    day  area  hour  time
  <int> <int> <int> <int>
1     1     1     0     6
2     1     1     1     8
3     1     1     2     3

或者对达到80%的时间进行线性估计:

df %>% group_by(day, area, hour) %>%
  mutate(cumcount = cumsum(count),
         p = cumcount/max(cumcount),
         g = +(p > 0.8),
         order = (g*2-1)*time) %>%
  group_by(day, area, hour,g) %>%
  filter(row_number((g*2-1)*time)==1) %>%
  group_by(day, area, hour) %>%
  summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))

结果:

    day  area  hour     time
  <int> <int> <int>    <dbl>
1     1     1     0 5.166667
2     1     1     1 7.600000
3     1     1     2 2.505263

或使用laglead

获得相同的结果
df %>% group_by(day, area, hour) %>%
  arrange(hour) %>%
  mutate(cumcount = cumsum(count),
         p = cumcount/max(cumcount)) %>%
  filter((p >= 0.8&lag(p)<0.8)|(p < 0.8&lead(p)>=0.8)) %>%
  summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))