我有一个数据框:
mi chr gen.pos phys.pos sample1 sample2 sample3 sample4
snp1 Ch09 NA 12712760 CC CC CT TT
snp3 Ch02 NA 16594215 GG HH GG GG
snp6 Ch14 NA 34284723 CC -- CC TT
snp7 Ch13 NA 21532194 AA GG AA GG
snp8 Ch13 NA 21532040 CC AA CC AA
snp9 Ch11 NA 38423068 TT CT CC CC
我想创建另外三列来保存sample4
与其他三个样本之间的比较结果,条件是:只比较列表c("AA","CC","GG","TT","HH")
中的任意两个值,返回值是TRUE
,否则为FALSE
。所以预期的结果是:
mi chr gen.pos phys.pos sample1 sample2 sample3 sample4 sample4_sample1 sample4_sample2 sample4_sample3
snp1 Ch09 NA 12712760 CC CC CT TT TRUE TRUE FALSE
snp3 Ch02 NA 16594215 GG HH GG GG FALSE TRUE FALSE
snp6 Ch14 NA 34284723 CC -- CC TT TRUE FALSE TRUE
snp7 Ch13 NA 21532194 AA GG AA GG TRUE FALSE TRUE
snp8 Ch13 NA 21532040 CC AA CC AA TRUE FALSE TRUE
snp9 Ch11 NA 38423068 TT CT CC CC TRUE FALSE FALSE
感谢您的帮助。
答案 0 :(得分:1)
您可以尝试
Un <- c("AA","CC","GG","TT","HH")
newCols <- paste(colnames(df)[8], colnames(df)[5:7], sep="_")
df[newCols] <- Map(function(x,y,z) x!=y &
apply(cbind(x,y), 1, FUN=function(.x) all(.x %in% z)),
df[paste0('sample', 1:3)], df['sample4'], list(Un))
df
# mi chr gen.pos phys.pos sample1 sample2 sample3 sample4 sample4_sample1
#1 snp1 Ch09 NA 12712760 CC CC CT TT TRUE
#2 snp3 Ch02 NA 16594215 GG HH GG GG FALSE
#3 snp6 Ch14 NA 34284723 CC -- CC TT TRUE
#4 snp7 Ch13 NA 21532194 AA GG AA GG TRUE
#5 snp8 Ch13 NA 21532040 CC AA CC AA TRUE
#6 snp9 Ch11 NA 38423068 TT CT CC CC TRUE
# sample4_sample2 sample4_sample3
#1 TRUE FALSE
#2 TRUE FALSE
#3 FALSE TRUE
#4 FALSE TRUE
#5 FALSE TRUE
#6 FALSE FALSE
df <- structure(list(mi = c("snp1", "snp3", "snp6", "snp7", "snp8",
"snp9"), chr = c("Ch09", "Ch02", "Ch14", "Ch13", "Ch13", "Ch11"
), gen.pos = c(NA, NA, NA, NA, NA, NA), phys.pos = c(12712760L,
16594215L, 34284723L, 21532194L, 21532040L, 38423068L), sample1 = c("CC",
"GG", "CC", "AA", "CC", "TT"), sample2 = c("CC", "HH", "--",
"GG", "AA", "CT"), sample3 = c("CT", "GG", "CC", "AA", "CC",
"CC"), sample4 = c("TT", "GG", "TT", "GG", "AA", "CC")), .Names = c("mi",
"chr", "gen.pos", "phys.pos", "sample1", "sample2", "sample3",
"sample4"), class = "data.frame", row.names = c(NA, -6L))