我想比较两个具有基因型的列,并生成一个新的布尔列。但是,有一些不同之处,例如 GG也可以等于CC或AA也可以等于TT,反之亦然。
df:
rsid ref sample
rs104211 CC GG
rs104998 AA TT
rs105063 TT AA
rs105076 AA AA
rs105078 TT GG
rs105090 AA GG
rs105162 AC AC
rs105304 AA TT
rs105338 AA GG
rs105490 GG CC
rs105491 AA AA
rs105492 AG AG
rs105705 AC AC
rs105975 AA GG
rs106213 AA AA
rs106396 GG CC
所需的输出:
rsid ref sample boolean
rs104211 CC GG TRUE
rs104998 AA TT TRUE
rs105063 TT AA TRUE
rs105076 AA AA TRUE
rs105078 TT GG FALSE
rs105090 AA GG FALSE
rs105162 AC AC TRUE
rs105304 AA TT TRUE
rs105338 AA GG FALSE
rs105490 GG CC TRUE
rs105491 AA AA TRUE
rs105492 AG AG TRUE
rs105705 AC AC TRUE
rs105975 AA GG FALSE
rs106213 AA AA TRUE
rs106396 GG CC TRUE
code:
match.boolean <- function(x) {
df <- if (x=="CC" | x=="GG" ) {
print("TRUE")
} else if (x=="AA" | x=="TT") {
print("TRUE")
} else if (x=="AC" | x=="AG") {
print("TRUE")
} else {
print("FALSE")
}
return(df)
}
df$boolean <- lapply(df,function(x) match.boolean(df[,2]==df[,3]))
但这是错误的。
答案 0 :(得分:3)
尝试一下(至少这是我认为逻辑表达式适用于您某些未陈述的可能性):
df$boolean <- with(df, ref == sample |
(ref %in% c("CC","GG") & sample %in% c("GG", "CC") )|
(ref %in% c("TT","AA") & sample %in% c("TT", "AA") ),
)
> df
rsid ref sample boolean
1 rs104211 CC GG TRUE
2 rs104998 AA TT TRUE
3 rs105063 TT AA TRUE
4 rs105076 AA AA TRUE
5 rs105078 TT GG FALSE
6 rs105090 AA GG FALSE
7 rs105162 AC AC FALSE
8 rs105304 AA TT TRUE
9 rs105338 AA GG FALSE
10 rs105490 GG CC TRUE
11 rs105491 AA AA TRUE
12 rs105492 AG AG FALSE
13 rs105705 AC AC FALSE
14 rs105975 AA GG FALSE
15 rs106213 AA AA TRUE
16 rs106396 GG CC TRUE
答案 1 :(得分:1)
我们可以使用所有可能的值创建一个名为comparison_list
的名称,然后使用mapply
comparison_list <- list(GGCC = c("GG", "CC"), AATT = c("AA", "TT"),
ACAG = c("AC", "AG"))
df$boolean <- mapply(function(x, y)
any(comparison_list[[grep(x, names(comparison_list))]] ==
comparison_list[[grep(y, names(comparison_list))]]),
df$ref, df$sample)
df
# rsid ref sample boolean
#1 rs104211 CC GG TRUE
#2 rs104998 AA TT TRUE
#3 rs105063 TT AA TRUE
#4 rs105076 AA AA TRUE
#5 rs105078 TT GG FALSE
#6 rs105090 AA GG FALSE
#7 rs105162 AC AC TRUE
#8 rs105304 AA TT TRUE
#9 rs105338 AA GG FALSE
#10 rs105490 GG CC TRUE
#11 rs105491 AA AA TRUE
#12 rs105492 AG AG TRUE
#13 rs105705 AC AC TRUE
#14 rs105975 AA GG FALSE
#15 rs106213 AA AA TRUE
#16 rs106396 GG CC TRUE
以上建议是为了减少列表的长度。您还可以为每个值创建单独的元素,这将使您的比较代码更简单
comparison_list <- list(GG = c("GG", "CC"), CC = c("GG", "CC"),
AA = c("AA", "TT"), TT = c("AA", "TT"),
AC = c("AC", "AG"), AG = c("AC", "AG"))
df$boolean <- mapply(function(x, y) any(comparison_list[[x]]==comparison_list[[y]]),
df$ref, df$sample)