我有一个包含30个变量和~6000行的数据集。变量“图表编号”标识每个主题,每个主题随着时间的推移重复测量。因此有大约500个独特的“图表编号”,但由于这些科目有多次出现,我们有大约6000个观测值。
现在,我需要使用受试者的平均身高来计算每个受试者的身高数据。如何将函数应用于图表编号的每个级别,即每个主题?
现在,我正在创建一个列表,并将每个主题的访问存储在列表中作为自己的数据框,然后在列表的所有元素(数据框)上运行循环。
如何在不创建列表的情况下将函数应用于图表编号的每个级别?
答案 0 :(得分:0)
考虑基础R ave
在跨因子级别运行方法的函数。具体而言,您可以在ave()
中添加ifelse
以保持不丢失的高度并替换缺失的高度值:
df$imp_visit_height <- ifelse(is.na(df$visit_height),
ave(df$visit_height, df$chart_number,
FUN=function(x) mean(x, na.rm=TRUE)),
df$visit_height)
或使用within()
df <- within(df, imp_visit_height <- ifelse(is.na(visit_height),
ave(visit_height, chart_number,
FUN=function(x) mean(x, na.rm=TRUE)),
visit_height))
使用随机数据演示 chart_number 涵盖计算语言/包:
set.seed(43018) # SEEDED FOR REPRODUCIBILITY
grp <- c("julia", "r", "pandas", "sas", "stata", "spss")
df <- data.frame(
chart_number = replicate(100, sample(grp, 1, replace=TRUE)),
time = as.Date(replicate(100, Sys.Date() - sample(1:120, 1, replace=TRUE)),
origin="1970-01-01"),
visit_height = rnorm(100, mean=50, sd=5),
measurement = rnorm(100)*100
)
# RANDOMLY ASSIGN 25 ROWS WITH NA TO visit_height
df[sample(1:100, 25, replace=TRUE), c("visit_height")] <- NA
# CONDITIONALLY IMPUTE MISSING VALUES
df$imp_visit_height <- ifelse(is.na(df$visit_height),
ave(df$visit_height, df$chart_number,
FUN=function(x) mean(x, na.rm=TRUE)),
df$visit_height)
输出 (过滤到缺少的visit_height)
df[is.na(df$visit_height),]
# chart_number time visit_height measurement imp_visit_height
# 4 sas 2018-02-03 NA -116.072314 49.77708
# 6 spss 2018-04-02 NA 33.049215 52.05987
# 12 julia 2018-01-14 NA 135.954163 52.49936
# 14 pandas 2018-04-09 NA -92.215212 49.23258
# 19 spss 2018-01-21 NA -43.422507 52.05987
# 27 julia 2018-03-18 NA -46.679790 52.49936
# 45 pandas 2018-03-19 NA -181.014747 49.23258
# 48 stata 2018-02-22 NA -89.135797 51.12526
# 51 spss 2018-01-24 NA 9.784664 52.05987
# 53 pandas 2018-04-23 NA 106.461095 49.23258
# 55 pandas 2018-02-17 NA 121.749821 49.23258
# 58 julia 2018-01-19 NA -151.584425 52.49936
# 65 pandas 2018-03-04 NA -148.877957 49.23258
# 70 r 2018-01-05 NA 83.888427 49.29048
# 71 sas 2018-02-21 NA -213.640525 49.77708
# 73 julia 2018-04-18 NA 181.791644 52.49936
# 79 r 2018-03-09 NA -4.446414 49.29048
# 82 pandas 2018-02-20 NA 28.069077 49.23258
# 84 julia 2018-02-27 NA 16.468641 52.49936
# 85 spss 2017-12-31 NA -106.316136 52.05987
# 86 r 2018-02-26 NA 1.450771 49.29048
# 91 spss 2018-04-05 NA -34.662075 52.05987
# 93 r 2018-03-03 NA 36.777125 49.29048
# 95 julia 2018-01-20 NA -36.827340 52.49936
# 98 julia 2017-12-31 NA 125.342483 52.49936