在数据框中的变量中找到n%条记录

时间:2019-02-16 11:29:39

标签: r datatable dplyr

我将数据存储在数据框中,第一列是日期,第二列是单个权重。这是来自数据的示例:

df <- data.frame(
  date = c("2019-01-01", "2019-01-01", "2019-01-01", "2019-01-01",
           "2019-01-01", "2019-01-01", "2019-01-01", "2019-01-01",
           "2019-01-01", "2019-01-01", "2019-01-02", "2019-01-02", "2019-01-02",
           "2019-01-02", "2019-01-02", "2019-01-02", "2019-01-02",
           "2019-01-02", "2019-01-02", "2019-01-02"),
  weight = c(2174.8, 2174.8, 2174.8, 8896.53, 8896.53, 2133.51, 2133.51,
             2892.32, 2892.32, 2892.32, 2892.32, 5287.78, 5287.78, 6674.03,
             6674.03, 6674.03, 6674.03, 6674.03, 5535.11, 5535.11)
)

我想先为每个日期运行简单的摘要统计信息,然后查找权重在给定范围内的记录数,并按权重总范围的百分比来定义类别。最后将每个记录的编号存储在单独的列中

Lowest 10%
10-20%
20-40%
40-60%
60-80%
80-90%
90-100%

The logic = (MinWeight + (MaxWeight-MinWeight)*X%)

这是我的预期结果(我只显示两列以显示百分比范围)

df %>% 
  group_by(date) %>%
  summarise(mean(weight), min(weight), max(weight))
   date       `mean(weight)` `min(weight)` `max(weight)` `Lowest 10%` `10-20%`
 2019-01-01          3726.         2134.         8897.    num records. num records.

2 个答案:

答案 0 :(得分:2)

检查此解决方案:

library(tidyverse)
library(wrapr)

df %>%
  group_by(date) %>%
  mutate(
    rn = row_number(),
    temp = weight - min(weight),
    temp = (temp / max(temp)) * 100,
    temp = cut(temp, seq(0, 100, 10), include.lowest = TRUE),
    temp = str_remove(temp, '\\(|\\[') %>%
      str_replace(',', '-') %>%
      str_replace('\\]', '%'),
    one = 1
  ) %>%
  spread(temp, one, fill = 0) %.>%
  left_join(
    summarise(.,
      `mean(weight)` = mean(weight),
      `min(weight)` = min(weight),
      `max(weight)` = max(weight)
    ),
    summarise_at(., vars(matches('\\d+-\\d+.')), sum)
  )

输出:

   date       `mean(weight)` `min(weight)` `max(weight)` `0-10%` `10-20%` `60-70%` `90-100%`
  <fct>               <dbl>         <dbl>         <dbl>   <dbl>    <dbl>    <dbl>     <dbl>
1 2019-01-01          3726.         2134.         8897.       5        3        0         2
2 2019-01-02          5791.         2892.         6674.       1        0        4         5

答案 1 :(得分:2)

可以通过以下方式完成:

library(tidyverse)

df %>%
  group_by(date) %>%
  mutate(
    wrange = cut((weight - min(weight)) / (max(weight - min(weight))) * 100, 10,
                 labels = paste(
                   seq(0, 90, by = 10), 
                   paste0(seq(10, 100, by = 10), "%"), 
                   sep = '-')
                 )
    ) %>%
  left_join(
    x = summarise_at(., vars(weight), funs(mean, min, max)),
    y = count(., wrange) %>% complete(wrange, fill = list(n = 0)) %>% spread(wrange, n),
    by = 'date'
    ) %>%
  rename_at(vars(matches("mean|min|max")), funs(paste(., "(weight)", sep = "")))

哪个输出:

#            date     mean(weight) min(weight) max(weight)  0-10%   10-20%  20-30%   30-40%  40-50%
#    1 2019-01-01     3726.144     2133.51     8896.53      5       3       0       0       0
#    2 2019-01-02     5790.825     2892.32     6674.03      1       0       0       0       0
#           50-60%  60-70%  70-80%  80-90%   90-100%
#           0       0       0       0        2
#           0       4       0       0        5

(我重新格式化了输出,以显示所有数据)