我的数据如下:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5
1 CLNS1A 0.2811747 0 2
1 RSF1 0.5469924 3 6
2 CFDP1 0.4186066 1 2
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
4 Gene1 0.7 0 1
4 Gene2 0.61 1 0
4 Gene3 0.62 0 1
我正在按Group
列对基因进行分组,然后根据条件选择每组中最好的基因:
如果得分最高的基因与该组中其他所有基因的得分差异大于0.05,则选择得分最高的基因
如果组中排名靠前的基因与任何其他基因之间的得分差异为<0.05,则选择具有较高direct_count
的基因,而仅选择与以下基因之间的距离<0.05每组得分最高的基因
如果direct_count
相同,则选择secondary_count
最高的基因
如果所有计数都相同,则选择所有彼此之间<0.05距离的基因。
示例输出如下:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5 #highest direct_count
2 CHST6 0.4295135 1 3 #highest secondary_count after matching direct_count
3 ACE 0.634 1 1 #ACE and NOS2 have matching counts
3 NOS2 0.6345 1 1
4 Gene1 0.7 0 1 #highest score by >0.05 difference
目前,我尝试使用以下代码进行编码:
df<- setDT(df)
new_df <- df[,
{d = dist(Score, method = 'manhattan')
if (any(d > 0.05))
ind = which.max(d)
else if (sum(max(direct_count) == direct_count) == 1L)
ind = which.max(direct_count)
else if (sum(max(secondary_count) == secondary_count) == 1L)
ind = which.max(secondary_count)
else
ind = which((outer(direct_count, direct_count, '==') & outer(secondary_count, secondary_count, '=='))[1, ])
.SD[ind]
}
, by = Group]
但是,我正在努力调整我的第一个else if
语句以解决我的第二个状况,仅在与得分最高的基因相距<0.05的基因之间进行选择-目前正在与每组的所有基因进行比较,因此即使例如,如果该组中的一个基因得分为0.1,但count
列最大,则它会在得分最高的基因(0.7)中被选中,例如,如果该组中的其他基因为0.68,则满足了<0.05距离要求。
本质上,我希望条件2到4仅考虑与每组得分最高的基因<0.05距离的基因。
输入数据:
structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2", "Gene1","Gene2","Gene3"), Score = c(0.5566507,
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345, 0.7, 0.62, 0.61), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L, 0L, 1L, 0L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L, 0L, 0L, 1L)), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
编辑:
我问这个问题的原因是一个特定的小组没有按我预期的那样做问题:
Group Gene Score direct_count secondary_count
1 2 CFDP1 0.5517401 1 62
2 2 CHST6 0.5989186 1 6
3 2 RNU6-758P 0.5644914 0 1
4 2 Gene1 0.5672916 0 1
5 2 TMEM170A 0.6167083 0 2
CHST6
在所有基因中最高的direct_count
在该组中得分最高的基因的<0.05之间,但仍选择了Gene1
。
第二个示例输入数据:
structure(list(Group = c(2L, 2L, 2L, 2L, 2L), Gene = c("CFDP1",
"CHST6", "RNU6-758P", "Gene1", "TMEM170A"), Score = c(0.551740109920502,
0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006
), direct_count = c(1, 1, 0, 0, 0), secondary_count = c(62,
6, 1, 1, 2)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
答案 0 :(得分:1)
您可以通过两种不同的解决方案来实现最终目标:dplyr
和data.table
。
您不需要任何复杂的ifelse
条件。
输入
dt <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"))
DPLYR
library(dplyr)
dt %>%
group_by(Group) %>%
filter((max(Score) - Score)<0.05) %>%
slice_max(direct_count, n = 1) %>%
slice_max(secondary_count, n = 1) %>%
ungroup()
#> # A tibble: 4 x 5
#> Group Gene Score direct_count secondary_count
#> <int> <chr> <dbl> <int> <int>
#> 1 1 AQP11 0.557 4 5
#> 2 2 CHST6 0.430 1 3
#> 3 3 ACE 0.634 1 1
#> 4 3 NOS2 0.634 1 1
DATA.TABLE
library(data.table)
dt <- dt[dt[, .I[(max(Score) - Score) < 0.05], by = Group]$V1]
dt <- dt[dt[, .I[direct_count == max(direct_count)], by = Group]$V1]
dt <- dt[dt[, .I[secondary_count == max(secondary_count)], by = Group]$V1]
dt
#> Group Gene Score direct_count secondary_count
#> 1: 1 AQP11 0.5566507 4 5
#> 2: 2 CHST6 0.4295135 1 3
#> 3: 3 ACE 0.6340000 1 1
#> 4: 3 NOS2 0.6345000 1 1
与您在问题结尾处的特定问题有关:这两种方法都选择CHST6,正如您根据编写的规则所期望的那样。
dt <- structure(list(Group = c(2L, 2L, 2L, 2L, 2L),
Gene = c("CFDP1", "CHST6", "RNU6-758P", "Gene1", "TMEM170A"),
Score = c(0.551740109920502, 0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006),
direct_count = c(1, 1, 0, 0, 0),
secondary_count = c(62, 6, 1, 1, 2)),
row.names = c(NA, -5L),
class = c("data.table",
"data.frame"))
########## DPLYR
library(dplyr)
dt %>%
group_by(Group) %>%
filter((max(Score) - Score)<0.05) %>%
slice_max(direct_count, n = 1) %>%
slice_max(secondary_count, n = 1) %>%
ungroup()
#> # A tibble: 1 x 5
#> Group Gene Score direct_count secondary_count
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 2 CHST6 0.599 1 6
########## DATATABLE
library(data.table)
dt <- dt[dt[, .I[(max(Score) - Score) < 0.05], by = Group]$V1]
dt <- dt[dt[, .I[direct_count == max(direct_count)], by = Group]$V1]
dt <- dt[dt[, .I[secondary_count == max(secondary_count)], by = Group]$V1]
dt
#> Group Gene Score direct_count secondary_count
#> 1: 2 CHST6 0.5989186 1 6