按组创建多个变量的均值和(标准偏差)表,格式化为发布

时间:2017-02-02 06:13:03

标签: r data.table knitr r-markdown aggregation

我正在学习R.我想使用简单易读的R代码为出版物生成摘要统计表。该表应该包含变量行,交替平均值和SD作为列,两个分组变量也作为列。所有值都应舍入为两位数,包括零(必要时添加零)。

以mtcars数据集为例,我希望表格看起来比较4,6和8个汽车,自动或手动):

|     |4 0       |        |4 1       |        |6 0       |        |6 1       |        |8 0       |        |8 1       |        |
|:----|:---------|:-------|:---------|:-------|:---------|:-------|:---------|:-------|:---------|:-------|:---------|:-------|
|     |mean      |(SD)    |mean      |(SD)    |mean      |(SD)    |mean      |(SD)    |mean      |(SD)    |mean      |(SD)    |
|mpg  |22.90     |(1.45)  |28.07     |(4.48)  |19.12     |(1.63)  |20.57     |(0.75)  |15.05     |(2.77)  |15.40     |(0.57)  |
|disp |135.87    |(13.97) |93.61     |(20.48) |204.55    |(44.74) |155.00    |(8.66)  |357.62    |(71.82) |326.00    |(35.36) |
|hp   |84.67     |(19.66) |81.88     |(22.66) |115.25    |(9.18)  |131.67    |(37.53) |194.17    |(33.36) |299.50    |(50.20) |

我编写了以下代码,但我仍然需要创建前两行,并将括号添加到SD列。为了使表格非常适合出版,我使用了R Markdown,knitr和kable。是否有更简单,更标准或更惯用的方式来做到这一点?

```{r Create-Table-1}
library(data.table)
library(knitr)

mtcars_dt <- data.table(mtcars)
myGroups <- c("cyl", "am")
myVariables <- c("mpg", "disp", "hp")

means_dt <- mtcars_dt[,lapply(.SD, mean), .SDcols = myVariables, by = myGroups]
means_dt.melted <- melt.data.table(means_dt, id.vars = myGroups, measure.vars = myVariables)
means_dt.melted$stat <- "mean"

sd_dt <- mtcars_dt[,lapply(.SD, sd), .SDcols=myVariables, by=myGroups]
sd_dt.melted <- melt.data.table(sd_dt, id.vars = myGroups, measure.vars = myVariables)
sd_dt.melted$stat <- "sd" 

means_sd_merged_dt <- rbindlist(list(means_dt.melted, sd_dt.melted))
means_sd_dt <- dcast.data.table(means_sd_merged_dt, variable ~ cyl + am + stat, value.var = "value")

kable(means_sd_dt, digits = 2)

```

这是代码生成的表。 &#34; 8_1_mean&#34;列未正确舍入。我试过pander,但它不能添加零。

|variable | 4_0_mean| 4_0_sd| 4_1_mean| 4_1_sd| 6_0_mean| 6_0_sd| 6_1_mean| 6_1_sd| 8_0_mean| 8_0_sd| 8_1_mean| 8_1_sd|
|:--------|--------:|------:|--------:|------:|--------:|------:|--------:|------:|--------:|------:|--------:|------:|
|mpg      |    22.90|   1.45|    28.07|   4.48|    19.12|   1.63|    20.57|   0.75|    15.05|   2.77|     15.4|   0.57|
|disp     |   135.87|  13.97|    93.61|  20.48|   204.55|  44.74|   155.00|   8.66|   357.62|  71.82|    326.0|  35.36|
|hp       |    84.67|  19.66|    81.88|  22.66|   115.25|   9.18|   131.67|  37.53|   194.17|  33.36|    299.5|  50.20|

更新: 我发布这个问题的主要原因之一是看看是否有更简单,更简单的方法来制作这种表格,使用其他库,以及编写最佳实践。

然而,chinsoon12提供了一个有效的答案,我将其纳入了我在R的第一个函数中。我在此更新,以便其他人可以修改和使用该函数。它仍然有一个我无法用数字和/或nsmall固定的错误,其中有时一个子组将比指定的数字多一个。

tabulatemsg <- function(variables, groups, input_dt, round_digits = 2, na.rm = FALSE) {
  # Create a table of alternating means and (SDs), for the specified variables, with groups as columns.
  require(data.table)

  # Aggregate means
  means_dt <- input_dt[,lapply(.SD, mean, na.rm = na.rm), .SDcols = variables, by = groups]
  means_dt.melted <- melt.data.table(means_dt, id.vars = groups, measure.vars = variables)
  means_dt.melted$stat <- "mean"

  # Aggregate standard deviations
  sd_dt <- input_dt[,lapply(.SD, sd, na.rm = na.rm), .SDcols=variables, by=groups]
  sd_dt.melted <- melt.data.table(sd_dt, id.vars = groups, measure.vars = variables)
  sd_dt.melted$stat <- "sd" 

  # Merge and cast
  means_sd_merged_dt <- rbindlist(list(means_dt.melted, sd_dt.melted))
  means_sd_dt <- dcast.data.table(means_sd_merged_dt, paste("variable", 
    paste(c(groups, "stat"), collapse=" + "), sep=" ~ "), value.var = "value")

  # Ensure there are the specified number of digits after the decimal
  cols <- setdiff(names(means_sd_dt), "variable")
  means_sd_dt[, (cols) := lapply(.SD, format, digits=round_digits, nsmall=round_digits, justify="none"), .SDcols=cols]
  means_sd_dt[, (cols) := lapply(.SD, trimws), .SDcols=cols]

  # Add in parentheses
  cols <- names(means_sd_dt)[seq(3, ncol(means_sd_dt), by=2)]
  means_sd_dt[, (cols) := lapply(.SD, function(x) paste0("(", x, ")")), .SDcols=cols]

  # Add in second row
  output_table <- rbindlist(list(
    data.table(t(c("", rep(c("Mean", "(SD)"), (ncol(means_sd_dt)-1)/2)))),
    means_sd_dt), use.names=FALSE)

  # Rename first row
  setnames(output_table, colnames(output_table), 
    gsub("variable", "", (gsub(" sd","", (gsub(" mean", "", (gsub("_"," ", colnames(means_sd_dt)))))))))

  return(output_table)
}

1 个答案:

答案 0 :(得分:1)

您可以使用format将每列转换为字符类,以便确保小数位后面有2位数,然后在括号中添加

#ensure there are 2 digits after decimal
cols <- setdiff(names(means_sd_dt), "variable")
means_sd_dt[, (cols) := lapply(.SD, format, digits=2, nsmall=2L, justify="none"), .SDcols=cols]
means_sd_dt[, (cols) := lapply(.SD, trimws), .SDcols=cols]

#add in parentheses
cols <- names(means_sd_dt)[seq(3, ncol(means_sd_dt), by=2)]
means_sd_dt[, (cols) := lapply(.SD, function(x) paste0("(", x, ")")), .SDcols=cols]

#add in first row
outputTbl <- rbindlist(list(
    data.table(t(c("", rep(c("mean", "(SD)"), (ncol(means_sd_dt)-1)/2)))),
    means_sd_dt), use.names=FALSE)

kable(outputTbl, digits = 2)