如何分组data.frame中的所有列?

时间:2018-06-14 05:31:18

标签: r

我在R中跟随data.frame

  Introvert      Extrovert      Nature       Presence
     0              -1            3             Yes     
     1               3            2             No
     2               5            4             Yes
     1              -2            0             No

现在,我想以下列方式编写响应代码:

    3,4 <- Positives
    0,1,2 <- Neutral
    < 0 <- Negatives

然后在PositivesNegatives之间获得NeutralsYesNo的计数。
我有20列反应,如上所述。我怎么能用简单的R代码呢?

我为每列做ifelse然后group_by

我的样本所需数据框将是:

         Introvert_Positive      Introvert_Negative     Introvert_Neutral

  Yes        0                         0                      2
  No         0                         0                      2  

2 个答案:

答案 0 :(得分:2)

这个怎么样?

library(tidyverse);
df %>%
    gather(key, value, -Presence) %>%
    mutate(bin = cut(
        value,
        breaks = c(-Inf, -1, 2.5, Inf),
        labels = c("Negatives", "Neutral", "Positives"))) %>%
    select(-value) %>%
    unite(col, key, bin, sep = "_") %>%
    count(Presence, col) %>%
    spread(col, n)
## A tibble: 2 x 6
#  Presence Extrovert_Negativ… Extrovert_Positi… Introvert_Neutr… Nature_Neutral
#  <fct>                 <int>             <int>            <int>          <int>
#1 No                        1                 1                2              2
#2 Yes                       1                 1                2             NA
## ... with 1 more variable: Nature_Positives <int>

说明:我们使用cutlabels重新编码回复;其余的问题是gatherunite相关列,count出现次数和spread从长到宽。

样本数据

df <- read.table(text =
    "Introvert      Extrovert      Nature       Presence
     0              -1            3             Yes
     1               3            2             No
     2               5            4             Yes
     1              -2            0             No", header = T)

答案 1 :(得分:1)

为了娱乐/练习,我使用@MauritsEvers答案的工作流程创建了一个data.table方法。 它比dplyr方法快约60%(见基准)

data.table

您可以跳过列{key}和bin的unite,因为在使用dcast时可以在与投射相同的步骤中处理。

df %>% 
  setDT() %>%
  melt( id = 4 ) %>%
  .[, bin := cut( value, 
                  breaks = c(-Inf, -1, 2.5, Inf),
                  labels = c("Negatives", "Neutral", "Positives") )] %>%
  .[, value := NULL] %>%
  .[, .N, by = c("Presence", "variable", "bin")] %>% 
  dcast( Presence ~ variable + bin, value.var = "N")



Presence Introvert_Neutral Extrovert_Negatives Extrovert_Positives Nature_Neutral Nature_Positives
1:       No                 2                   1                   1              2               NA
2:      Yes                 2                   1                   1             NA                2

基准

library(microbenchmark)
microbenchmark(
  dplyr = {
    df %>%
      gather(key, value, -Presence) %>%
      mutate(bin = cut(
        value,
        breaks = c(-Inf, -1, 2.5, Inf),
        labels = c("Negatives", "Neutral", "Positives"))) %>%
      select(-value) %>%
      unite(col, key, bin, sep = "_") %>%
      count(Presence, col) %>%
      spread(col, n)
  },
  data.table = {
    df %>% 
      setDT() %>%
      melt( id = 4 ) %>%
      .[, bin := cut( value, 
                      breaks = c(-Inf, -1, 2.5, Inf),
                      labels = c("Negatives", "Neutral", "Positives") )] %>%
      .[, value := NULL] %>%
      .[, .N, by = c("Presence", "variable", "bin")] %>% 
      dcast( Presence ~ variable + bin, value.var = "N")
  },
  times = 1000
)

Unit: milliseconds
       expr      min        lq     mean    median        uq      max neval
      dplyr 9.636224 10.083903 10.59597 10.267371 10.458524 26.38649  1000
 data.table 3.458208  3.647401  3.92219  3.835239  3.949568 15.05596  1000