使我的R功能更快

时间:2017-06-16 12:37:42

标签: r

编辑:对于低质量的帖子抱歉。我应该花更多的时间向你介绍这件事。该帖子已经过编辑,我已经为整个事情添加了一个有效的语法示例。感谢所有提供建议的人。

EDIT2:发现脚本在另一台计算机上运行缓慢。可能是由某些本地问题或REPL引起的。

我做了这个功能。它在标记的(标记的&避免包)数据帧中产生值的频率表。它有效,但我打算在有很多列的数据帧上使用它,我认为它运行有点慢;用户可能会认为R在运行超过100列时已经崩溃,所以我想加快速度。

此脚本的要点是生成输出,帮助我查找调查数据集中的处理错误。它有点繁琐,因为我想知道答案频率,同时评估值标签的形状。因此,此脚本为每个变量生成一个频率表,显示频率,未使用的标签和没有值标签的值。在查看脚本的输出时,这有望更清楚。

如果你能指出一些提高效率的方法,我将不胜感激:

# demonstration dataset
library(knitr)
library(data.table)
library(labelled)

df <- data.frame(q1 = rep(1:6, 3), q2 = rep(6:1, 3))
val_labels(df[, c("q1", "q2")]) <- c(YES = 1, MAYBE = 2, NO = 3, DK = 4, MISSING=5)
val_label(df$q2, 1) <- NULL

# Produce a frequency table over values and labels in a labelled-class dataframe object
# --------------------------------------------------------------------------------------------------
#    Example:    freqlab(ds[[1]])    or    freqlab(ds[1:10])    or    freqlab(ds)
#    Wrong:      freqlab(ds[1])

freqlab <- function(x){

    # If the function is called on double brackets, eg. freqlab(ds[[11]])
    if (!is.list(x)){

        # Make a frequency distribution, put it in a data.table
        xFreq <- data.table(table(x))
        names(xFreq) <- c("Value", "Frequency")
        class(xFreq[[1]]) <- "numeric"
        setkey(xFreq, Value)

        # Put the value labels in another data.table
        if (!is.null(val_labels(x))){
            xLab <- data.table(val_labels(x), names(val_labels(x)))
            names(xLab) <- c("Value", "Label")
            setkey(xLab, Value)
        } else {
            # If the variable does not have labels, create one to avoid errors
            xLab <- data.table(xFreq[[1,1]], "** UNLABELLED **")
            names(xLab) <- c("Value", "Label")
            setkey(xLab, Value)
        }

        # Perform a FULL OUTER JOIN 
        outTable <- merge(xFreq, xLab, all = TRUE)

        # Arrange values in ascending order of absolute value
        outTable <- arrange(outTable, abs(outTable[[1]]))

        # Edit the Label column for value cases with no label
        outTable[[2]][is.na(outTable[[2]])] <- 0
        outTable[[3]][is.na(outTable[[3]])] <- "** UNLABELLED **"

        # If the output has more than 25 rows, cut it short
        if (dim(outTable)[1] > 25){
            outTable <- outTable[1:25]
        }

        # Output the table
        print(kable(outTable, format = "rst", align = "l"))


    # If the function is called on a list of variables, eg. freqlab(ds[10:11]),
    #   do the same steps as above, looping through all the input variables
    } else {

        for (y in 1:length(x)){

            xFreq <- data.table(table(x[[y]]))
            names(xFreq) <- c("Value", "Frequency")
            class(xFreq[[1]]) <- "numeric"
            setkey(xFreq, Value)

            if (!is.null(val_labels(x[[y]]))){
                xLab <- data.table(val_labels(x[[y]]), names(val_labels(x[[y]])))
                names(xLab) <- c("Value", "Label")
                setkey(xLab, Value)
            } else {
                xLab <- data.table(xFreq[[1,1]], "** UNLABELLED **")
                names(xLab) <- c("Value", "Label")
                setkey(xLab, Value)
            }

            outTable <- merge(xFreq, xLab, all = TRUE)
            outTable <- arrange(outTable, abs(outTable[[1]]))
            outTable[[2]][is.na(outTable[[2]])] <- 0
            outTable[[3]][is.na(outTable[[3]])] <- "** UNLABELLED **"

            if (dim(outTable)[1] > 25){
                outTable <- outTable[1:25]
            }

            # Extra information printed when function is called on a list of variables
            cat("Name:\t", names(x[y]),"\n")
            print(kable(outTable, format = "rst", align = "l"))           
            cat(rep("-", 80), sep='', "\n\n")
        }
    }
}

输出示例:

> freqlab(df)
Name:    q1


=====  =========  ================
Value  Frequency  Label
=====  =========  ================
1      3          YES
2      3          MAYBE
3      3          NO
4      3          DK
5      3          MISSING
6      3          ** UNLABELLED **
=====  =========  ================
--------------------------------------------------------------------------------

Name:    q2


=====  =========  ================
Value  Frequency  Label
=====  =========  ================
1      3          ** UNLABELLED **
2      3          MAYBE
3      3          NO
4      3          DK
5      3          MISSING
6      3          ** UNLABELLED **
=====  =========  ================
--------------------------------------------------------------------------------

1 个答案:

答案 0 :(得分:1)

没有玩具数据,更简单的代码以及对输入和输出的清晰解释,帮助您并不容易。无论如何,第一步通常是分析您的代码,以确定消耗时间的瓶颈。有关?Rprof - 提供分析信息的函数,请参阅Rprof()

这个小例子说明了如何使用它:

square <- function (x) {
 Sys.sleep(3)
 return(x^2)
}

add <- function (x, y) {
 Sys.sleep(1)
  return(x + y)
}

complicatedFunction <- function(x, y) {
  res <- square(add(square(x), square(y)))
  return(res)
}

# Try to profile out "complicated" function
Rprof()  # Start of profiling
res <- complicatedFunction(2, 5)  # Function to profile
Rprof(NULL) # End of profiling
summaryRprof() # Show results
#$by.self
#            self.time self.pct total.time total.pct
#"Sys.sleep"      9.54      100       9.54       100
#
#$by.total
#                      total.time total.pct self.time self.pct
#"Sys.sleep"                 9.54    100.00      9.54      100
#"complicatedFunction"       9.54    100.00      0.00        0
#"square"                    9.54    100.00      0.00        0
#"add"                       6.58     68.97      0.00        0
#
#$sample.interval
#[1] 0.02
#
#$sampling.time
#[1] 9.54

在这里你可以看到在函数的被调用函数中花费了多长时间---在这个示例中Sys.sleep明显占用了所有的时间。有关如何理解此输出的详细信息,请参阅?summaryRprof