计算多个变量的因子水平的出现并在一个表中汇总结果

时间:2014-09-20 19:59:27

标签: r

这是我在这里的第一篇文章,我是编程和R的新手。所以请原谅任何愚蠢。

我有以下数据框:

a <- data.frame("sickness1" = c(1,1,2,3,3,5,6, 4, 4, 4),
                "sickness2" = c(NA, NA, 3, 3, 4, 6, 1, 2, 5, 6),
                "sickness3" = c(NA, NA, 3, 4, 4, 6, 1, 2, 5, 6),
                "sickness4" = c(NA, NA, 6, 3, 4, 6, 1, 2, 5, 6))

每行代表一个案例。每列是有序因子变量。我将变量转换为这样的因素(使用我在stackoverflow上找到的提示!):

a[] <- lapply(a, factor,
             levels = c(1:6),
             labels = c(3, 25, 50, 75, 97, 100))

我想得到以下输出:

  percent   sickness1           sickness2    sickness3       sickness4
1       3          1                1            1            2
2      25          1                1            1            1
3      50          2                1            1            2
4      75          1                2            1            3
5      97          1                1            1            1
6     100          2                2            3            1

我已经找到了一个非常漫长的解决方案:

# counting
ab <- ldply(lapply(a, count))

#getting it into the right format
ab2 <- dcast(
    data = ab,
    formula = x ~ .id,
    value.var = "freq")

# changing the name of the first column
colnames(ab2)[1] <- "percent"

#deleting row 7 cause it contains the NAs which I dont want to have
ab2 <- ab2[-7,]
ab2

有更快更简单的方法吗?喜欢以某种方式使用ddply? 摘要(a)给出的输出太乱了,我不知道如何操纵它来看我想要的方式。我正在使用的真实数据也更大,我必须做很多次这样的事情......

3 个答案:

答案 0 :(得分:1)

好的,所以我发现有两种可能的解决方案:

Nr1 by akrun:

un1 <- as.character(sort(unique(unlist(a, use.names=FALSE))))
 data.frame(percent=un1,do.call(cbind,
          lapply(a, function(x) table(factor(x, levels=un1)))))

nr.2 by alexis_laz:

鉴于我可以轻松地使数据看起来像这样:(这只是上面一个为该机构添加了一列的数据框)

a <- data.frame("institution" = c(1:10), "sickness1" = c(1,1,2,3,3,5,6, 4, 4, 4),
                "sickness2" = c(NA, NA, 3, 3, 4, 6, 1, 2, 5, 6),
                "sickness3" = c(NA, NA, 3, 4, 4, 6, 1, 2, 5, 6),
                "sickness4" = c(NA, NA, 6, 3, 4, 6, 1, 2, 5, 6))

a[-1] <- lapply(a[-1], factor,
                levels = c(1:6),
                labels = c("0 to 3%","4-25%", "25-50%", "51-75%","76-97%","97-100%"))

然后我可以将这种宽数据格式转换为长数据格式,如下所示:

b2 <- melt(a, id.vars = "institution")

然后普通的表函数起作用:

table(b2[[3]], b2[[2]])

请注意,订购很重要

非常感谢你们!

答案 1 :(得分:1)

这主要是主题类型答案的变体。同时使用stacktable,如下所示:

as.data.frame.matrix(           ## converts the output to a data.frame
  table(                        ## does the actual tabulation
    stack(                      ## stack makes your data.frame long 
      lapply(a, as.character)), ## but won't work with factors; convert to char
        useNA = "no")           ## we don't want NA values
       )[levels(a[[1]]), ]      ## We want our rows in a nicer order
#     sickness1 sickness3 sickness4 sickness5
# 3           2         1         1         1
# 25          1         1         1         1
# 50          2         2         1         1
# 75          3         1         2         1
# 97          1         1         1         1
# 100         1         2         2         3

或者,这是“dplyr”+“tidyr”方法:

library(dplyr)
library(tidyr)

a %>% gather(var, val, sickness1:sickness5) %>%     ## make the data long
  mutate(val = factor(val, levels(unlist(a)))) %>%  ## refactor "val" column
  rev %>%                         ## reverse the order of val and var
  table %>%                       ## make your table
  as.data.frame.matrix            ## convert it to a data.frame

#     sickness1 sickness3 sickness4 sickness5
# 3           2         1         1         1
# 25          1         1         1         1
# 50          2         2         1         1
# 75          3         1         2         1
# 97          1         1         1         1
# 100         1         2         2         3

答案 2 :(得分:0)

您可以尝试:

 un1 <- as.character(sort(unique(unlist(a, use.names=FALSE))))
 data.frame(percent=un1,do.call(cbind,
          lapply(a, function(x) table(factor(x, levels=un1)))))