通过比较两列来使新的布尔列变异

时间:2019-01-17 06:19:27

标签: r

我想比较两个具有基因型的列,并生成一个新的布尔列。但是,有一些不同之处,例如 GG也可以等于CC或AA也可以等于TT,反之亦然。

df: 
rsid    ref sample
rs104211    CC  GG
rs104998    AA  TT
rs105063    TT  AA
rs105076    AA  AA
rs105078    TT  GG
rs105090    AA  GG
rs105162    AC  AC
rs105304    AA  TT
rs105338    AA  GG
rs105490    GG  CC
rs105491    AA  AA
rs105492    AG  AG
rs105705    AC  AC
rs105975    AA  GG
rs106213    AA  AA
rs106396    GG  CC

所需的输出:

rsid    ref sample  boolean
rs104211    CC  GG  TRUE
rs104998    AA  TT  TRUE
rs105063    TT  AA  TRUE
rs105076    AA  AA  TRUE
rs105078    TT  GG  FALSE
rs105090    AA  GG  FALSE
rs105162    AC  AC  TRUE
rs105304    AA  TT  TRUE
rs105338    AA  GG  FALSE
rs105490    GG  CC  TRUE
rs105491    AA  AA  TRUE
rs105492    AG  AG  TRUE
rs105705    AC  AC  TRUE
rs105975    AA  GG  FALSE
rs106213    AA  AA  TRUE
rs106396    GG  CC  TRUE

code:
match.boolean <- function(x) {
df <- if (x=="CC" | x=="GG" ) {
print("TRUE") 
} else if (x=="AA" | x=="TT") {
print("TRUE")
} else if (x=="AC" | x=="AG") {
print("TRUE")
} else {
print("FALSE")
}
return(df)
}

df$boolean <- lapply(df,function(x) match.boolean(df[,2]==df[,3]))

但这是错误的。

2 个答案:

答案 0 :(得分:3)

尝试一下(至少这是我认为逻辑表达式适用于您某些未陈述的可能性):

df$boolean <- with(df, ref == sample |
                             (ref %in% c("CC","GG") & sample %in% c("GG", "CC") )| 
                             (ref %in% c("TT","AA") & sample %in% c("TT", "AA") ), 
                 )
> df
       rsid ref sample boolean
1  rs104211  CC     GG    TRUE
2  rs104998  AA     TT    TRUE
3  rs105063  TT     AA    TRUE
4  rs105076  AA     AA    TRUE
5  rs105078  TT     GG   FALSE
6  rs105090  AA     GG   FALSE
7  rs105162  AC     AC   FALSE
8  rs105304  AA     TT    TRUE
9  rs105338  AA     GG   FALSE
10 rs105490  GG     CC    TRUE
11 rs105491  AA     AA    TRUE
12 rs105492  AG     AG   FALSE
13 rs105705  AC     AC   FALSE
14 rs105975  AA     GG   FALSE
15 rs106213  AA     AA    TRUE
16 rs106396  GG     CC    TRUE

答案 1 :(得分:1)

我们可以使用所有可能的值创建一个名为comparison_list的名称,然后使用mapply

comparison_list <- list(GGCC = c("GG", "CC"), AATT = c("AA", "TT"),
                        ACAG = c("AC", "AG"))


df$boolean <- mapply(function(x, y) 
              any(comparison_list[[grep(x, names(comparison_list))]] == 
                  comparison_list[[grep(y, names(comparison_list))]]), 
              df$ref, df$sample)

df
#       rsid ref sample boolean
#1  rs104211  CC     GG    TRUE
#2  rs104998  AA     TT    TRUE
#3  rs105063  TT     AA    TRUE
#4  rs105076  AA     AA    TRUE
#5  rs105078  TT     GG   FALSE
#6  rs105090  AA     GG   FALSE
#7  rs105162  AC     AC    TRUE
#8  rs105304  AA     TT    TRUE
#9  rs105338  AA     GG   FALSE
#10 rs105490  GG     CC    TRUE
#11 rs105491  AA     AA    TRUE
#12 rs105492  AG     AG    TRUE
#13 rs105705  AC     AC    TRUE
#14 rs105975  AA     GG   FALSE
#15 rs106213  AA     AA    TRUE
#16 rs106396  GG     CC    TRUE

以上建议是为了减少列表的长度。您还可以为每个值创建单独的元素,这将使您的比较代码更简单

comparison_list <- list(GG = c("GG", "CC"), CC = c("GG", "CC"), 
                        AA = c("AA", "TT"), TT = c("AA", "TT"), 
                        AC = c("AC", "AG"), AG = c("AC", "AG"))

df$boolean <- mapply(function(x, y) any(comparison_list[[x]]==comparison_list[[y]]), 
                df$ref, df$sample)