我想使用Hmisc :: wtd.quantile为具有许多重复日期的数据框创建一个新的df。我正在按日期分组,使用summarize()
汇总日期,并尝试在每个日期(带有权重)使用wtd.quantile()
。这也是一个相当大的数据集。下面是一些示例代码:
# sample data
# grouping_var = dt_time
require(Hmisc)
require(plyr)
require(dplyr)
df <- data.frame(type = sample(letters[1:2], 10e6, replace = TRUE),
score = sample(500:899, 10e6, replace = TRUE),
dt_time = sample(seq(as.Date('2010/01/01'),
as.Date('2018/01/01'),
by="day"), 10e6, replace = TRUE),
weight = sample(1.0:2.0, 10e6, replace = TRUE))
# my attempt:
ptiles <- df %>%
group_by(dt_time) %>%
plyr::ddply(~dt_time, dplyr::summarize,
ptile10 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
probs = .1, na.rm = TRUE),
ptile50 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
probs = .5, na.rm = TRUE),
ptile90 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
probs = .9, na.rm = TRUE))
# desired df,
# where each new variable would be created using the
# wtd.quantile function:
desired_ptiles <- data.frame(dt_time = seq(as.Date('2010/01/01'),
as.Date('2010/01/06'),
by = "day"),
# only 6 because lol 10e6
ptile10 = sample(500:899, 6, replace = TRUE),
ptile50 = sample(500:899, 6, replace = TRUE),
ptile90 = sample(500:899, 6, replace = TRUE))
到目前为止,我的努力导致了此错误:
Error in summarise_impl(.data, dots) :
Evaluation error: 'arg' must be NULL or a character vector.
使用公式表示法时:
ptiles <- df %>%
group_by(dt_time) %>%
plyr::ddply(~dt_time, dplyr::summarize,
ptile10 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
probs = .1, na.rm = TRUE),
ptile50 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
probs = .5, na.rm = TRUE),
ptile90 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
probs = .9, na.rm = TRUE))
# error message:
Error in summarise_impl(.data, dots) :
Evaluation error: 'arg' must be NULL or a character vector.
我以错误的方式处理此问题吗?我见过使用split()
的方法,但这似乎很烦人。是否有一种data.table
方法可以使wtd.quantile()
以此方式进行汇总?
谢谢!
答案 0 :(得分:2)
使用ddply
时不需要group_by
,因为数据已经按分组进行了拆分。另外,您无需在分组后在summarize
内定义数据。
这有效:
ptiles <- df %>%
group_by(dt_time) %>%
summarize(ptile10 = wtd.quantile(score, weights = weight,
probs = .1, na.rm = TRUE),
ptile50 = wtd.quantile(score, weights = weight,
probs = .5, na.rm = TRUE),
ptile90 = wtd.quantile(score, weights = weight,
probs = .9, na.rm = TRUE))
> ptiles
# A tibble: 2,923 x 4
dt_time ptile10 ptile50 ptile90
<date> <dbl> <dbl> <dbl>
1 2010-01-01 539.0 697 859.0
2 2010-01-02 538.0 704 861.7
3 2010-01-03 541.0 706 862.0
4 2010-01-04 541.0 702 859.0
5 2010-01-05 540.0 706 860.0
6 2010-01-06 537.0 695 859.0
7 2010-01-07 539.0 696 859.0
8 2010-01-08 536.0 700 857.0
9 2010-01-09 538.0 694 861.0
10 2010-01-10 538.4 701 859.0
# ... with 2,913 more rows
答案 1 :(得分:0)
这是一个替代解决方案,它使用一个函数(基于@LAP解决方案)遍历您提供的一组分位数(即c(.1, .5, .9)
)并自动创建适当的列:
require(Hmisc)
require(plyr)
require(tidyverse)
# for reproducibility
set.seed(5)
# example dataset
df <- data.frame(type = sample(letters[1:2], 50000, replace = TRUE),
score = sample(500:899, 50000, replace = TRUE),
dt_time = sample(seq(as.Date('2010/01/01'),
as.Date('2018/01/01'),
by="day"), 50000, replace = TRUE),
weight = sample(1.0:2.0, 50000, replace = TRUE))
# function to get quantiles per date
f = function(x) {
df %>%
group_by(dt_time) %>%
dplyr::summarize(value = wtd.quantile(score, weights = weight, probs = x, na.rm = TRUE)) %>%
mutate(ptile = x) }
# input quantiles of interest
q = c(.1, .5, .9)
# apply function to each quantile of interest
# and reshape
map_df(q, f) %>%
spread(ptile, value, sep = "_")
# # A tibble: 2,923 x 4
# dt_time ptile_0.1 ptile_0.5 ptile_0.9
# <date> <dbl> <dbl> <dbl>
# 1 2010-01-01 543 670. 857
# 2 2010-01-02 549 800 868.
# 3 2010-01-03 529. 705 829.
# 4 2010-01-04 543 690 850.
# 5 2010-01-05 528. 642. 870.
# 6 2010-01-06 534. 649 810.
# 7 2010-01-07 512. 668 801.
# 8 2010-01-08 554 664 869.
# 9 2010-01-09 533. 650 765
#10 2010-01-10 535. 696 843.
# # ... with 2,913 more rows