我的目标是创建一个函数,当循环数据框的多个变量时,将返回一个新数据框,其中包含每个变量的每个级别的百分比和95%置信区间。
例如,如果我将此功能应用于" cyl"和" am"从mtcars数据框中,我希望这是最终结果:
variable level ci.95
1 cyl 4 34.38 (19.50, 53.11)
2 cyl 6 21.88 (10.35, 40.45)
3 cyl 8 43.75 (27.10, 61.94)
4 am 0 59.38 (40.94, 75.5)
5 am 1 40.62 (24.50, 59.06)
所以,我的功能似乎适用于单个变量;但是,我有两个问题,我希望社区可以帮助我:
一般R-ifying我的代码。我还是R新手。我已经阅读了足够的帖子,知道R爱好者通常不鼓励使用for
循环,但我仍然很难使用apply函数(在大多数情况下,它似乎是for
循环的替代品)。
将此函数应用于变量列表 - 导致单个数据框包含每个变量的每个级别的函数返回值。
到目前为止,我的代码在哪里:
t1.props <- function(x, data = NULL) {
# Grab dataframe and/or variable name
if(!missing(data)){
var <- data[,deparse(substitute(x))]
} else {
var <- x
}
# Grab variable name for use in ouput
var.name <- substitute(x)
# Omit observations with missing data
var.clean <- na.omit(var)
# Number of nonmissing observations
n <- length(var.clean)
# Grab levels of variable
levels <- sort(unique(var.clean))
# Create an empty data frame to store values
out <- data.frame(variable = NA,
level = NA,
ci.95 = NA)
# Estimate prop, se, and ci for each level of the variable
for(i in seq_along(levels)) {
prop <- paste0("prop", i)
se <- paste0("se", i)
log.prop <- paste0("log.trans", i)
log.se <- paste0("log.se", i)
log.l <- paste0("log.l", i)
log.u <- paste0("log.u", i)
lcl <- paste0("lcl", i)
ucl <- paste0("ucl", i)
# Find the proportion for each level of the variable
assign(prop, sum(var.clean == levels[i]) / n)
# Find the standard error for each level of the variable
assign(se, sd(var.clean == levels[i]) /
sqrt(length(var.clean == levels[i])))
# Perform a logit transformation of the original percentage estimate
assign(log.prop, log(get(prop)) - log(1 - get(prop)))
# Transform the standard error of the percentage to a standard error of its
# logit transformation
assign(log.se, get(se) / (get(prop) * (1 - get(prop))))
# Calculate the lower and upper confidence bounds of the logit
# transformation
assign(log.l,
get(log.prop) -
qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
assign(log.u,
get(log.prop) +
qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
# Finally, perform inverse logit transformations to get the confidence bounds
assign(lcl, exp(get(log.l)) / (1 + exp(get(log.l))))
assign(ucl, exp(get(log.u)) / (1 + exp(get(log.u))))
# Create a combined 95% CI variable for easy copy/paste into Word tables
ci.95 <- paste0(round(get(prop) * 100, 2), " ",
"(", sprintf("%.2f", round(get(lcl) * 100, 2)), ",", " ",
round(get(ucl) * 100, 2), ")")
# Populate the "out" data frame with values
out <- rbind(out, c(as.character(var.name), levels[i], ci.95))
}
# Remove first (empty) row from out
# But only in the first iteration
if (is.na(out[1,1])) {
out <- out[-1, ]
rownames(out) <- 1:nrow(out)
}
out
}
data(mtcars)
t1.props(cyl, mtcars)
感谢您提供的任何帮助或建议。
答案 0 :(得分:0)
您正在使用的所有函数的好处是它们已经被矢量化(sd
和qt
除外,但您可以使用Vectorize
轻松地为它们呈现特定参数。这意味着您可以将向量传递给它们,而无需编写单个循环。我遗漏了函数中用于准备输入和调整输出的部分。
t1.props <- function(var, data=mtcars) {
N <- nrow(data)
levels <- names(table(data[,var]))
count <- unclass(table(data[,var])) # counts
prop <- count / N # proportions
se <- sqrt(prop * (1-prop)/(N-1)) # standard errors of props.
lprop <- log(prop) - log(1-prop) # logged prop
lse <- se / (prop*(1-prop)) # logged se
stat <- Vectorize(qt, "df")(0.975, N-1) # tstats
llower <- lprop - stat*lse # log lower
lupper <- lprop + stat*lse # log upper
lower <- exp(llower) / (1 + exp(llower)) # lower ci
upper <- exp(lupper) / (1 + exp(lupper)) # upper ci
data.frame(variable=var,
level=levels,
perc=100*prop,
lower=100*lower,
upper=100*upper)
}
因此,当您将函数应用于多个变量时,会出现唯一的显式应用/循环,如下所示
## Apply your function to two variables
do.call(rbind, lapply(c("cyl", "am"), t1.props))
# variable level perc lower upper
# 4 cyl 4 34.375 19.49961 53.11130
# 6 cyl 6 21.875 10.34883 40.44691
# 8 cyl 8 43.750 27.09672 61.94211
# 0 am 0 59.375 40.94225 75.49765
# 1 am 1 40.625 24.50235 59.05775
就代码中的循环而言,就效率而言,这并不是特别重要,但是你可以看到简洁的代码可以读取多少 - 并且应用函数提供了很多简单的代码 - 线解决方案。
我认为代码中最重要的一点是使用assign
和get
。相反,您可以将变量存储在列表或其他数据结构中,并在需要时使用setNames
,names<-
或names(...) <-
命名组件。
答案 1 :(得分:0)
您还可以保持该功能完好无损,并使用lapply
:
vars <- c("cyl", "am")
lapply(vars, t1.props, data=mtcars)
[[1]]
variable level ci.95
1 cyl 4 34.38 (19.50, 53.11)
2 cyl 6 21.88 (10.35, 40.45)
3 cyl 8 43.75 (27.10, 61.94)
[[2]]
variable level ci.95
1 am 0 59.38 (40.94, 75.5)
2 am 1 40.62 (24.50, 59.06)
并将它们全部合并到一个数据框中:
lst <- lapply(vars, t1.props, data=mtcars)
do.call(rbind,lst)
数据强>
您必须将var
和var.name
分配简化为:
t1.props <- function(x, data = NULL) {
# Grab dataframe and/or variable name
if(!missing(data)){
var <- data[,x]
} else {
var <- x
}
# Grab variable name for use in ouput
var.name <- x
# Omit observations with missing data
var.clean <- na.omit(var)
# Number of nonmissing observations
n <- length(var.clean)
# Grab levels of variable
levels <- sort(unique(var.clean))
# Create an empty data frame to store values
out <- data.frame(variable = NA,
level = NA,
ci.95 = NA)
# Estimate prop, se, and ci for each level of the variable
for(i in seq_along(levels)) {
prop <- paste0("prop", i)
se <- paste0("se", i)
log.prop <- paste0("log.trans", i)
log.se <- paste0("log.se", i)
log.l <- paste0("log.l", i)
log.u <- paste0("log.u", i)
lcl <- paste0("lcl", i)
ucl <- paste0("ucl", i)
# Find the proportion for each level of the variable
assign(prop, sum(var.clean == levels[i]) / n)
# Find the standard error for each level of the variable
assign(se, sd(var.clean == levels[i]) /
sqrt(length(var.clean == levels[i])))
# Perform a logit transformation of the original percentage estimate
assign(log.prop, log(get(prop)) - log(1 - get(prop)))
# Transform the standard error of the percentage to a standard error of its
# logit transformation
assign(log.se, get(se) / (get(prop) * (1 - get(prop))))
# Calculate the lower and upper confidence bounds of the logit
# transformation
assign(log.l,
get(log.prop) -
qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
assign(log.u,
get(log.prop) +
qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
# Finally, perform inverse logit transformations to get the confidence bounds
assign(lcl, exp(get(log.l)) / (1 + exp(get(log.l))))
assign(ucl, exp(get(log.u)) / (1 + exp(get(log.u))))
# Create a combined 95% CI variable for easy copy/paste into Word tables
ci.95 <- paste0(round(get(prop) * 100, 2), " ",
"(", sprintf("%.2f", round(get(lcl) * 100, 2)), ",", " ",
round(get(ucl) * 100, 2), ")")
# Populate the "out" data frame with values
out <- rbind(out, c(as.character(var.name), levels[i], ci.95))
}
# Remove first (empty) row from out
# But only in the first iteration
if (is.na(out[1,1])) {
out <- out[-1, ]
rownames(out) <- 1:nrow(out)
}
out
}