编辑:对于低质量的帖子抱歉。我应该花更多的时间向你介绍这件事。该帖子已经过编辑,我已经为整个事情添加了一个有效的语法示例。感谢所有提供建议的人。
EDIT2:发现脚本在另一台计算机上运行缓慢。可能是由某些本地问题或REPL引起的。
我做了这个功能。它在标记的(标记的&避免包)数据帧中产生值的频率表。它有效,但我打算在有很多列的数据帧上使用它,我认为它运行有点慢;用户可能会认为R在运行超过100列时已经崩溃,所以我想加快速度。
此脚本的要点是生成输出,帮助我查找调查数据集中的处理错误。它有点繁琐,因为我想知道答案频率,同时评估值标签的形状。因此,此脚本为每个变量生成一个频率表,显示频率,未使用的标签和没有值标签的值。在查看脚本的输出时,这有望更清楚。
如果你能指出一些提高效率的方法,我将不胜感激:
# demonstration dataset
library(knitr)
library(data.table)
library(labelled)
df <- data.frame(q1 = rep(1:6, 3), q2 = rep(6:1, 3))
val_labels(df[, c("q1", "q2")]) <- c(YES = 1, MAYBE = 2, NO = 3, DK = 4, MISSING=5)
val_label(df$q2, 1) <- NULL
# Produce a frequency table over values and labels in a labelled-class dataframe object
# --------------------------------------------------------------------------------------------------
# Example: freqlab(ds[[1]]) or freqlab(ds[1:10]) or freqlab(ds)
# Wrong: freqlab(ds[1])
freqlab <- function(x){
# If the function is called on double brackets, eg. freqlab(ds[[11]])
if (!is.list(x)){
# Make a frequency distribution, put it in a data.table
xFreq <- data.table(table(x))
names(xFreq) <- c("Value", "Frequency")
class(xFreq[[1]]) <- "numeric"
setkey(xFreq, Value)
# Put the value labels in another data.table
if (!is.null(val_labels(x))){
xLab <- data.table(val_labels(x), names(val_labels(x)))
names(xLab) <- c("Value", "Label")
setkey(xLab, Value)
} else {
# If the variable does not have labels, create one to avoid errors
xLab <- data.table(xFreq[[1,1]], "** UNLABELLED **")
names(xLab) <- c("Value", "Label")
setkey(xLab, Value)
}
# Perform a FULL OUTER JOIN
outTable <- merge(xFreq, xLab, all = TRUE)
# Arrange values in ascending order of absolute value
outTable <- arrange(outTable, abs(outTable[[1]]))
# Edit the Label column for value cases with no label
outTable[[2]][is.na(outTable[[2]])] <- 0
outTable[[3]][is.na(outTable[[3]])] <- "** UNLABELLED **"
# If the output has more than 25 rows, cut it short
if (dim(outTable)[1] > 25){
outTable <- outTable[1:25]
}
# Output the table
print(kable(outTable, format = "rst", align = "l"))
# If the function is called on a list of variables, eg. freqlab(ds[10:11]),
# do the same steps as above, looping through all the input variables
} else {
for (y in 1:length(x)){
xFreq <- data.table(table(x[[y]]))
names(xFreq) <- c("Value", "Frequency")
class(xFreq[[1]]) <- "numeric"
setkey(xFreq, Value)
if (!is.null(val_labels(x[[y]]))){
xLab <- data.table(val_labels(x[[y]]), names(val_labels(x[[y]])))
names(xLab) <- c("Value", "Label")
setkey(xLab, Value)
} else {
xLab <- data.table(xFreq[[1,1]], "** UNLABELLED **")
names(xLab) <- c("Value", "Label")
setkey(xLab, Value)
}
outTable <- merge(xFreq, xLab, all = TRUE)
outTable <- arrange(outTable, abs(outTable[[1]]))
outTable[[2]][is.na(outTable[[2]])] <- 0
outTable[[3]][is.na(outTable[[3]])] <- "** UNLABELLED **"
if (dim(outTable)[1] > 25){
outTable <- outTable[1:25]
}
# Extra information printed when function is called on a list of variables
cat("Name:\t", names(x[y]),"\n")
print(kable(outTable, format = "rst", align = "l"))
cat(rep("-", 80), sep='', "\n\n")
}
}
}
输出示例:
> freqlab(df)
Name: q1
===== ========= ================
Value Frequency Label
===== ========= ================
1 3 YES
2 3 MAYBE
3 3 NO
4 3 DK
5 3 MISSING
6 3 ** UNLABELLED **
===== ========= ================
--------------------------------------------------------------------------------
Name: q2
===== ========= ================
Value Frequency Label
===== ========= ================
1 3 ** UNLABELLED **
2 3 MAYBE
3 3 NO
4 3 DK
5 3 MISSING
6 3 ** UNLABELLED **
===== ========= ================
--------------------------------------------------------------------------------
答案 0 :(得分:1)
没有玩具数据,更简单的代码以及对输入和输出的清晰解释,帮助您并不容易。无论如何,第一步通常是分析您的代码,以确定消耗时间的瓶颈。有关?Rprof
- 提供分析信息的函数,请参阅Rprof()
。
这个小例子说明了如何使用它:
square <- function (x) {
Sys.sleep(3)
return(x^2)
}
add <- function (x, y) {
Sys.sleep(1)
return(x + y)
}
complicatedFunction <- function(x, y) {
res <- square(add(square(x), square(y)))
return(res)
}
# Try to profile out "complicated" function
Rprof() # Start of profiling
res <- complicatedFunction(2, 5) # Function to profile
Rprof(NULL) # End of profiling
summaryRprof() # Show results
#$by.self
# self.time self.pct total.time total.pct
#"Sys.sleep" 9.54 100 9.54 100
#
#$by.total
# total.time total.pct self.time self.pct
#"Sys.sleep" 9.54 100.00 9.54 100
#"complicatedFunction" 9.54 100.00 0.00 0
#"square" 9.54 100.00 0.00 0
#"add" 6.58 68.97 0.00 0
#
#$sample.interval
#[1] 0.02
#
#$sampling.time
#[1] 9.54
在这里你可以看到在函数的被调用函数中花费了多长时间---在这个示例中Sys.sleep
明显占用了所有的时间。有关如何理解此输出的详细信息,请参阅?summaryRprof
。