当我有采样权重时,我试图计算每个组内每个观察点的分位数(0到100),让我们称之为'值',并将每个观察值分配给新变量中的各自的分位数
换句话说,每行是观察,每个观察都属于一个组。所有组都有超过2个观察结果。在每个组中,我需要使用我的数据中的抽样权重来估计值的分布,确定观察值在其组的分布中的百分位数,然后将该百分位数作为列添加到数据框中。
据我所知,survey
包有svyby()
和svyquantile()
,但后者返回指定分位数的值,而不是给定观察值的分位数。
# Load survey package
library(survey)
# Set seed for replication
set.seed(123)
# Create data with value, group, weight
dat <- data.frame(value = 1:6,
group = rep(1:3,2),
weight = abs(rnorm(6))
# Declare survey design
d <- survey::svydesign(id =~1, data = dat, weights = weight)
# Do something to calculate the quantile and add it to the data
????
这与此问题类似,但不是由子组完成的:Compute quantiles incorporating Sample Design (Survey package)
答案 0 :(得分:0)
我整理了一个解决方案。可以修改mutate()
中的以下语句序列,以将采样权重转换为感兴趣的任何分位数。虽然这可以在基础R中完成,但我使用dplyr
包,因为dplyr::bind_rows()
的功能在加入两个数据帧时添加了NA。
# Set seed for replication
set.seed(123)
# Create data with value, group, weight
dat <- data.frame(value = 1:6,
group = rep(1:3,2),
weight = abs(rnorm(6))
# Initialize list for storing group results
# Setting the length of the list is quicker than
# creating an empty list and growing it
quantile_list <- vector("list", length(unique(dat$group)))
# Initialize variable to indicate initial iteration
iteration <- 0
# estimate the decile of each respondent
# in a large for-loop
for(group in unique(dat$group)) {
# Keep only observations for a given group
temp <- dat %>% dplyr::filter(group == group)
# Create subset with missing values
temp_missing <- temp %>% dplyr::filter(is.na(value))
# Create subset without missing values
temp_nonmissing <- temp %>% dplyr::filter(!is.na(value))
# Sort observations with value on value, calculate cumulative
# sum of sampling weights, create variable indicating the decile
# of responses. 1 = lowest, 10 = highest
temp_nonmissing <- temp_nonmissing %>%
dplyr::arrange(value) %>%
dplyr::mutate(cumulative_weight = cumsum(weight),
cumulative_weight_prop = cumulative_weight / sum(weight),
decile = dplyr::case_when(cumulative_weight_prop < 0.10 ~ 1,
cumulative_weight_prop >= 0.10 & cumulative_weight_prop < 0.20 ~ 2,
cumulative_weight_prop >= 0.20 & cumulative_weight_prop < 0.30 ~ 3,
cumulative_weight_prop >= 0.30 & cumulative_weight_prop < 0.40 ~ 4,
cumulative_weight_prop >= 0.40 & cumulative_weight_prop < 0.50 ~ 5,
cumulative_weight_prop >= 0.50 & cumulative_weight_prop < 0.60 ~ 6,
cumulative_weight_prop >= 0.60 & cumulative_weight_prop < 0.70 ~ 7,
cumulative_weight_prop >= 0.70 & cumulative_weight_prop < 0.80 ~ 8,
cumulative_weight_prop >= 0.80 & cumulative_weight_prop < 0.90 ~ 9 ,
cumulative_weight_prop >= 0.90 ~ 10))
# Increment the iteration of the for loop
iteration <- iteration + 1
# Join the data with missing values and the data without
# missing values on the value variable into
# a single data frame
quantile_list[[iteration]] <- dplyr::bind_rows(temp_nonmissing, temp_missing)
}
# Convert the list of data frames into a single dataframe
out <- dplyr::bind_rows(quantile_list)
# Show outcome
head(out)