Question

我有一个[1,758 x 38]数据框，其中每一行都是一个职位发布，而列是每个职位所需的技能（技能1到技能38）。大多数职位发布都有许多相同的技能，除了它们列在不同的列中。我想提供所需技能的摘要统计数据（例如，最常见的技能要求）。我可以使用data[, .N, keyby = skills1]：

为单个列生成此内容

internal.properties

但我无法实现循环机制来遍历每一列。我该怎么做？

Answer 1

您可以使用base R循环列，在lapply中执行此操作。输出将是“列表”。

lapply(data, table)

或@thelatemail提到，'wide'格式可以转换为'long'，包含2列，然后执行table

library(reshape2)
table(melt(as.matrix(data))[-1])

使用data.table的类似方法是

library(data.table)
setDT(melt(as.matrix(data))[-1])[, .N, .(Var2, value)]

或使用mtabulate

library(qdapTools)
mtabulate(data)

Answer 2

我在名为sumstats的包装函数中使用apply，它产生主要的统计指标：

CV = function(x, ...) {sd(x, ...)/mean(x, ...)}
sumstats=function(x, ...) {
  mean.k=function(x) {if (is.numeric(x)) round(mean(x, ...), digits = 2)
                      else "N*N"}
  median.k=function(x) {  if (is.numeric(x)) round(median(x, ...), digits = 2)
                          else "N*N"}
  sd.k=function(x) {  if (is.numeric(x)) round(sd(x, ...), digits = 2)
                      else "N*N"}
  cv.k=function(x) {  if (is.numeric(x)) round(CV(x, ...), digits = 2)
                      else "N*N"}
  min.k=function(x) {  if (is.numeric(x)) round(min(x, ...), digits = 2)
                       else "N*N"}
  max.k=function(x) {  if (is.numeric(x)) round(max(x, ...), digits = 2)
                       else "N*N"}
  sumtable <- cbind(as.matrix(colSums(!is.na(x))), sapply(x,mean.k), sapply(x,median.k), sapply(x,sd.k),  sapply(x,cv.k), sapply(x,min.k), sapply(x,max.k))
  sumtable <- as.data.frame(sumtable);  names(sumtable) <- c("N.obs","Moy","Med","sd","CV", "min","max")
  return(sumtable)
}

[...]允许您添加na.rm = T参数

> head(sumstats(mtcars), 3)
###      N.obs    Moy    Med     sd   CV   min    max
### mpg     32  20.09  19.20   6.03 0.30 10.40  33.90
### cyl     32   6.19   6.00   1.79 0.29  4.00   8.00
### disp    32 230.72 196.30 123.94 0.54 71.10 472.00

注意：如果您只有一列，它就不起作用了！

如何在R中的多个列中生成摘要统计信息？

2 个答案: