满足给定条件的不同列之间的值差异

时间:2018-11-13 16:37:57

标签: r tidyverse

这是我的玩具数据。我有val和四分位数变量q0到q4。

 df <- tibble::tribble(
      ~val, ~q0, ~q1, ~q2,  ~q3, ~q4, ~q, ~diff,
       15L, 15L, 15L, 15L,   15, 15L, 4L,     0,
       17L,  2L, 16L, 30L,   34, 54L, 2L,    13,
       29L,  2L, 16L, 30L,   34, 54L, 2L,     1,
       25L,  2L, 17L, 20L,   26, 43L, 3L,     1 )

我需要计算最后两个变量,以便:

  1. 当val在q1和q2之间时,我为q(2nd)选择2(在q2中) 行)
  2. 如果有平局,我选择qs的最大值(例如,第一行中q = 4)
  3. diff是q和val之差。因此,对于第1行,它的q4-val = 0,对于第2行,它的q2-val = 30-17 = 13。

如何最好使用tidyverse计算R中的q和diff?也许我们可以在这里利用答案:Extract column name and specific value based on a condition

2 个答案:

答案 0 :(得分:2)

当您具有这样更复杂的逻辑时,我发现通常最好将其包装在函数中。将来将更易于维护,读取和调试。当使用很多嵌套的ifelse语句或大的case_when类型的东西时,我也要格外小心。在接受的答案中,q只能为2、3或4。q没有提供为1的情况,您肯定希望在最终产品中将其作为选项。

df <- tibble::tribble(
~val, ~q0, ~q1, ~q2,  ~q3, ~q4, ~q, ~diff,
15L, 15L, 15L, 15L,   15, 15L, 4L,     0,
17L,  2L, 16L, 30L,   34, 54L, 2L,    13,
29L,  2L, 16L, 30L,   34, 54L, 2L,     1,
25L,  2L, 17L, 20L,   26, 43L, 3L,     1 )

whichQ <- function(df, qs = c('q0', 'q1', 'q2', 'q3', 'q4')) {
    # This has the flexibility of changing your column names / using more or less Q splits
    qDf <- df[, qs]
    # This finds the right quantile by finding how many you are larger than
    # It works because the q's are sequential
    whichGreater <- df$val >= qDf
    q <- apply(whichGreater, 1, sum)
    # 4 is a special case because there is no next quantile
    q <- ifelse(q == 5, 4, q)
    df$q <- q
    # Go through the Qs we found and grab the value of that column
    diff <- sapply(seq_along(q), function(x) {
        as.integer(qDf[x, q[x]+1])
    })
    # Get the difference
    df$diff <- diff - df$val
    df
}

您仍然可以在tidyverse管道中使用它,但是(我认为)只要您将函数命名为有用的东西,就会更清楚了。

df %>% 
    whichQ %>% 
    head(2)

答案 1 :(得分:1)

尝试:

library(tidyverse)
df <- tribble(
        ~val, ~q0, ~q1, ~q2,  ~q3, ~q4,
        15L, 15L, 15L, 15L,   15, 15L,
        17L,  2L, 16L, 30L,   34, 54L,
        29L,  2L, 16L, 30L,   34, 54L,
        25L,  2L, 17L, 20L,   26, 43L)

df %>%
        mutate(q = ifelse(val > q1 & val < q2, 2,
                          ifelse(val == q0 & val == q1 & val == q2 & val == q3 & val == q4, 4,
                                 3)),
               diff = ifelse(val > q1 & val < q2, q2 - val,
                             ifelse(val == q0 & val == q1 & val == q2 & val == q3 & val == q4, q4 - val,
                                    q3 - val)))
# A tibble: 4 x 8
    val    q0    q1    q2    q3    q4     q  diff
  <int> <int> <int> <int> <dbl> <int> <dbl> <dbl>
1    15    15    15    15    15    15     4     0
2    17     2    16    30    34    54     2    13
3    29     2    16    30    34    54     2     1
4    25     2    17    20    26    43     3     1

使用case_when(假设valq2q3之间时,您选择3)。

df %>%
        mutate(q = case_when(val > q1 & val < q2  ~ 2,
                             val == q0 & val == q1 & val == q2 & val == q3 & val == q4 ~ 4,
                             val > q2 & val < q3 ~ 3),
               diff = case_when(val > q1 & val < q2 ~ q2 - val,
                                val == q0 & val == q1 & val == q2 & val == q3 & val == q4 ~ q4 - val,
                                val > q2 & val < q3 ~ as.integer(q3 - val)))
# A tibble: 4 x 8
    val    q0    q1    q2    q3    q4     q  diff
  <int> <int> <int> <int> <dbl> <int> <dbl> <int>
1    15    15    15    15    15    15     4     0
2    17     2    16    30    34    54     2    13
3    29     2    16    30    34    54     2     1
4    25     2    17    20    26    43     3     1