用累积频率有效地替换数据帧

时间:2012-10-23 18:46:05

标签: r hash dataframe vectorization cumulative-frequency

我正在尝试编写一个占用大数据帧的程序,并用这些值的累积频率(按升序排序)替换每列值。例如,如果值列为:5,8,3,5,4,3,8,5,5,1那么相对和累积频率为:

  • 1:rel_freq = 0.1,cum_freq = 0.1
  • 3:rel_freq = 0.2,cum_freq = 0.3
  • 4:rel_freq = 0.1,cum_freq = 0.4
  • 5:rel_freq = 0.4,cum_freq = 0.8
  • 8:rel_freq = 0.2,cum_freq = 1.0

然后原始列变为:0.8,1.0,0.3,0.8,0.4,0.3,1.0,0.8,0.8,0.1

以下代码正确执行此操作,但由于嵌套循环,它可能会缩小。知道如何更有效地执行此任务吗?

mydata = read.table(.....)

totalcols = ncol(mydata)
totalrows = nrow(mydata)

for (i in 1:totalcols) {
    freqtable = data.frame(table(mydata[,i])/totalrows)  # create freq table
    freqtable$CumSum = cumsum(freqtable$Freq)   # calc cumulative freq

    hashtable = new.env(hash=TRUE)
    nrows = nrow(freqtable)

    # store cum freq in hash
    for (x in 1:nrows) {
        dummy = toString(freqtable$Var1[x])
        hashtable[[dummy]] = freqtable$CumSum[x]
    }

    # replace original data with cum freq
    for (j in 1:totalrows) {
        dummy = toString(mydata[j,i])
        mydata[j,i] = hashtable[[dummy]]
    }
}

2 个答案:

答案 0 :(得分:2)

这会处理没有for循环的单个列:

R> x <- c(5, 8, 3, 5, 4, 3, 8, 5, 5, 1)
R> y <- cumsum(table(x)/length(x))
R> y[as.character(x)]
  5   8   3   5   4   3   8   5   5   1 
0.8 1.0 0.3 0.8 0.4 0.3 1.0 0.8 0.8 0.1 

答案 1 :(得分:1)

这是一种方法。使用包含两个变量的数据框,每个变量包含您的示例数据

d <- data.frame(var1 = c(5, 8, 3, 5, 4, 3, 8, 5, 5, 1),
                var2 = c(5, 8, 3, 5, 4, 3, 8, 5, 5, 1))

使用简单的函数

  1. 生成cumsum()给出的table(x) / length(x)相对比例,然后
  2. match()变量中的观察值与累积总和表的名称,然后
  3. 使用id匹配从累积总和表中选择(并取消命名)
  4. 这样的功能是:

    f <- function(x) {
        tab <- cumsum(table(x) / length(x))
        ind <- match(x, as.numeric(names(tab)))
        unname(tab[ind])
    }
    

    在实践中,我们使用lapply()并强制转换为数据框:

    out <- data.frame(lapply(d, f))
    out
    

    给出:

    R> out
       var1 var2
    1   0.8  0.8
    2   1.0  1.0
    3   0.3  0.3
    4   0.8  0.8
    5   0.4  0.4
    6   0.3  0.3
    7   1.0  1.0
    8   0.8  0.8
    9   0.8  0.8
    10  0.1  0.1