如何在不比较每个值的情况下有条件地选择每个组的最高值?

时间:2020-10-30 11:34:37

标签: r if-statement data.table

我的数据如下:

  Group Gene      Score     direct_count   secondary_count 
    1   AQP11    0.5566507       4               5
    1   CLNS1A   0.2811747       0               2
    1   RSF1     0.5469924       3               6
    2   CFDP1    0.4186066       1               2
    2   CHST6    0.4295135       1               3
    3   ACE      0.634           1               1
    3   NOS2     0.6345          1               1
    4   Gene1    0.7             0               1
    4   Gene2    0.61            1               0
    4   Gene3    0.62            0               1          

我正在按Group列对基因进行分组,然后根据条件选择每组中最好的基因:

  1. 如果得分最高的基因与该组中其他所有基因的得分差异大于0.05,则选择得分最高的基因

  2. 如果组中排名靠前的基因与任何其他基因之间的得分差异为<0.05,则选择具有较高direct_count 的基因,而仅选择与以下基因之间的距离<0.05每组得分最高的基因

  3. 如果direct_count相同,则选择secondary_count最高的基因

  4. 如果所有计数都相同,则选择所有彼此之间<0.05距离的基因。

示例输出如下:

 Group Gene      Score     direct_count   secondary_count 
    1   AQP11    0.5566507       4               5  #highest direct_count
    2   CHST6    0.4295135       1               3  #highest secondary_count after matching direct_count
    3   ACE      0.634           1               1  #ACE and NOS2 have matching counts
    3   NOS2     0.6345          1               1
    4   Gene1    0.7             0               1  #highest score by >0.05 difference

目前,我尝试使用以下代码进行编码:

df<- setDT(df)
new_df <- df[, 
   {d = dist(Score, method = 'manhattan')
   if (any(d > 0.05)) 
     ind = which.max(d)
   else if (sum(max(direct_count) == direct_count) == 1L) 
     ind = which.max(direct_count)
   else if (sum(max(secondary_count) == secondary_count) == 1L) 
     ind = which.max(secondary_count)
   else 
     ind = which((outer(direct_count, direct_count, '==') & outer(secondary_count, secondary_count, '=='))[1, ])
   
   .SD[ind]
   }
   , by = Group]

但是,我正在努力调整我的第一个else if语句以解决我的第二个状况,仅在与得分最高的基因相距<0.05的基因之间进行选择-目前正在与每组的所有基因进行比较,因此即使例如,如果该组中的一个基因得分为0.1,但count列最大,则它会在得分最高的基因(0.7)中被选中,例如,如果该组中的其他基因为0.68,则满足了<0.05距离要求。

本质上,我希望条件2到4仅考虑与每组得分最高的基因<0.05距离的基因。

输入数据:

structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), Gene = c("AQP11", 
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2", "Gene1","Gene2","Gene3"), Score = c(0.5566507, 
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345, 0.7, 0.62, 0.61), direct_count = c(4L, 
0L, 3L, 1L, 1L, 1L, 1L, 0L, 1L, 0L), secondary_count = c(5L, 2L, 6L, 2L, 
3L, 1L, 1L, 0L, 0L, 1L)), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"))

编辑:

我问这个问题的原因是一个特定的小组没有按我预期的那样做问题:

  Group Gene         Score      direct_count     secondary_count
1   2    CFDP1        0.5517401        1                  62
2   2    CHST6        0.5989186        1                   6
3   2    RNU6-758P    0.5644914        0                   1
4   2    Gene1        0.5672916        0                   1
5   2    TMEM170A     0.6167083        0                   2

CHST6在所有基因中最高的direct_count在该组中得分最高的基因的<0.05之间,但仍选择了Gene1

第二个示例输入数据:

structure(list(Group = c(2L, 2L, 2L, 2L, 2L), Gene = c("CFDP1", 
"CHST6", "RNU6-758P", "Gene1", "TMEM170A"), Score = c(0.551740109920502, 
0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006
), direct_count = c(1, 1, 0, 0, 0), secondary_count = c(62, 
6, 1, 1, 2)), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"))

1 个答案:

答案 0 :(得分:1)

您可以通过两种不同的解决方案来实现最终目标:dplyrdata.table

您不需要任何复杂的ifelse条件。

解决方案

输入

dt <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11", 
                                                                           "CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507, 
                                                                                                                                         0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L, 
                                                                                                                                                                                                                      0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L, 
                                                                                                                                                                                                                                                                   3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table", 
                                                                                                                                                                                                                                                                                                                   "data.frame"))

DPLYR

library(dplyr)

dt %>% 
  group_by(Group) %>% 
  filter((max(Score) - Score)<0.05) %>% 
  slice_max(direct_count, n = 1) %>% 
  slice_max(secondary_count, n = 1) %>% 
  ungroup()
#> # A tibble: 4 x 5
#>   Group Gene  Score direct_count secondary_count
#>   <int> <chr> <dbl>        <int>           <int>
#> 1     1 AQP11 0.557            4               5
#> 2     2 CHST6 0.430            1               3
#> 3     3 ACE   0.634            1               1
#> 4     3 NOS2  0.634            1               1

DATA.TABLE

library(data.table)

dt <- dt[dt[, .I[(max(Score) - Score) < 0.05], by = Group]$V1]
dt <- dt[dt[, .I[direct_count == max(direct_count)], by = Group]$V1]
dt <- dt[dt[, .I[secondary_count == max(secondary_count)], by = Group]$V1]
dt
#>    Group  Gene     Score direct_count secondary_count
#> 1:     1 AQP11 0.5566507            4               5
#> 2:     2 CHST6 0.4295135            1               3
#> 3:     3   ACE 0.6340000            1               1
#> 4:     3  NOS2 0.6345000            1               1

您的编辑

与您在问题结尾处的特定问题有关:这两种方法都选择CHST6,正如您根据编写的规则所期望的那样。

dt <- structure(list(Group = c(2L, 2L, 2L, 2L, 2L), 
               Gene = c("CFDP1", "CHST6", "RNU6-758P", "Gene1", "TMEM170A"), 
               Score = c(0.551740109920502,  0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006),
               direct_count = c(1, 1, 0, 0, 0), 
               secondary_count = c(62, 6, 1, 1, 2)), 
          row.names = c(NA, -5L), 
          class = c("data.table", 
                    "data.frame"))


########## DPLYR

library(dplyr)

dt %>% 
  group_by(Group) %>% 
  filter((max(Score) - Score)<0.05) %>% 
  slice_max(direct_count, n = 1) %>% 
  slice_max(secondary_count, n = 1) %>% 
  ungroup()
#> # A tibble: 1 x 5
#>   Group Gene  Score direct_count secondary_count
#>   <int> <chr> <dbl>        <dbl>           <dbl>
#> 1     2 CHST6 0.599            1               6


########## DATATABLE

library(data.table)

dt <- dt[dt[, .I[(max(Score) - Score) < 0.05], by = Group]$V1]
dt <- dt[dt[, .I[direct_count == max(direct_count)], by = Group]$V1]
dt <- dt[dt[, .I[secondary_count == max(secondary_count)], by = Group]$V1]
dt
#>    Group  Gene     Score direct_count secondary_count
#> 1:     2 CHST6 0.5989186            1               6