Question

我想添加一个列，向我显示该行与部分列相比的百分比部分。（对不起我这里糟糕的数学英语）。

> trees['Heigth_%'] <- round((100 / sum(trees$Height) * trees$Height), digits=2)
> head(trees)
  Girth Height Volume Heigth_%
1   8.3     70   10.3     2.97
2   8.6     65   10.3     2.76
3   8.8     63   10.2     2.67
4  10.5     72   16.4     3.06
5  10.7     81   18.8     3.44
6  10.8     83   19.7     3.52

这项工作。但问题是这是否是一种好的和类似R的方式？

e.g。每行都需要sum()吗？或者R在这里足够聪明吗？

Answer 1

要回答你的问题，如果每行调用sum或者R足够智能，你可以使用跟踪：

df = data.frame(a = 1:10, b = 21:30)
df['b_%'] = round((100 / sum(df$b) * df$b), digits=2)

trace('sum')
round((100 / sum(df$b) * df$b), digits=2)
untrace('sum')

仅显示对sum功能的一次调用。之后，R识别trees$Height和sum(trees$Height)的长度不同，并尝试复制较短的长度，直到与较大的长度相同。

Answer 2

首先，Vandenman's answer更加充足和准确。以下内容并不值得回答，但与往常一样 - 不能作为评论阅读。

我已添加了prop.table()和data.table()（请参阅majom's answer）方法。有40k行data.table()更接近其余部分，但仍然更慢（约3毫秒到约3.7毫秒），其中400k行开始具有可比性，而4M行最终比其余行快：

library(microbenchmark)

trees <- data.frame(Height = runif(400000, 9, 11),
                    Heigth_PCT = numeric(4000000))
trees_dt <- as.data.table(trees)

f1 <- function(trees) {
  trees$Heigth_PCT <- round((100 / sum(trees$Height) * trees$Height), digits = 2)
  return(trees)
}

f2 <- function(trees) {
  sum_trees <- sum(trees$Height)
  trees$Heigth_PCT <- round((100 / sum_trees * trees$Height), digits = 2)
  return(trees)
}

f3 <- function(trees) {
  trees$Heigth_PCT <- round(prop.table(trees$Height)*100, digits = 2)
  return(trees)
}

f4 <- function(trees_dt) {
  trees_dt[, Heigth_PCT := round(prop.table(Height)*100, 2)] 
}

# Time both functions
microbenchmark(r1 <- f1(trees),
               r2 <- f2(trees),
               r3 <- f3(trees),
               r4 <- f4(trees_dt),
               times = 100)
# Unit: milliseconds
#               expr      min       lq     mean   median       uq      max neval cld
#    r1 <- f1(trees) 296.4452 309.3853 373.5945 318.7987 400.0373 639.8556   100   a
#    r2 <- f2(trees) 296.3453 310.6638 381.4048 323.0655 474.9295 682.2172   100   a
#    r3 <- f3(trees) 304.3185 317.0654 383.9600 328.5494 395.6238 783.2435   100   a
# r4 <- f4(trees_dt) 304.3327 315.4685 361.9526 325.8711 366.1153 722.7629   100   a

sapply(list(r2, r3, as.data.frame(r4)), identical, r1)
# [1] TRUE TRUE TRUE

类似于R的方法将％列添加到数据框

2 个答案:

编辑：`prop.table()`已添加。

编辑2：`data.table()`已添加。

类似于R的方法将％列添加到数据框

2 个答案:

编辑：prop.table()已添加。

编辑2：data.table()已添加。

编辑：`prop.table()`已添加。

编辑2：`data.table()`已添加。