r中条件的列之间的字符值比较

时间:2014-12-07 03:20:02

标签: r

我有一个数据框:

 mi chr gen.pos phys.pos    sample1 sample2 sample3 sample4
 snp1   Ch09    NA  12712760    CC  CC  CT  TT
 snp3   Ch02    NA  16594215    GG  HH  GG  GG
 snp6   Ch14    NA  34284723    CC  --  CC  TT
 snp7   Ch13    NA  21532194    AA  GG  AA  GG
 snp8   Ch13    NA  21532040    CC  AA  CC  AA
 snp9   Ch11    NA  38423068    TT  CT  CC  CC

我想创建另外三列来保存sample4与其他三个样本之间的比较结果,条件是:只比较列表c("AA","CC","GG","TT","HH")中的任意两个值,返回值是TRUE,否则为FALSE。所以预期的结果是:

 mi chr gen.pos phys.pos    sample1 sample2 sample3 sample4 sample4_sample1 sample4_sample2 sample4_sample3
 snp1   Ch09    NA  12712760    CC  CC  CT  TT  TRUE    TRUE    FALSE
 snp3   Ch02    NA  16594215    GG  HH  GG  GG  FALSE   TRUE    FALSE
 snp6   Ch14    NA  34284723    CC  --  CC  TT  TRUE    FALSE   TRUE
 snp7   Ch13    NA  21532194    AA  GG  AA  GG  TRUE    FALSE   TRUE
 snp8   Ch13    NA  21532040    CC  AA  CC  AA  TRUE    FALSE   TRUE
 snp9   Ch11    NA  38423068    TT  CT  CC  CC  TRUE    FALSE   FALSE

感谢您的帮助。

1 个答案:

答案 0 :(得分:1)

您可以尝试

Un <-  c("AA","CC","GG","TT","HH")
newCols <- paste(colnames(df)[8], colnames(df)[5:7], sep="_")
df[newCols] <-  Map(function(x,y,z) x!=y &
                    apply(cbind(x,y), 1, FUN=function(.x) all(.x %in% z)),
                      df[paste0('sample', 1:3)],  df['sample4'], list(Un))

df
#    mi  chr gen.pos phys.pos sample1 sample2 sample3 sample4 sample4_sample1
#1 snp1 Ch09      NA 12712760      CC      CC      CT      TT            TRUE
#2 snp3 Ch02      NA 16594215      GG      HH      GG      GG           FALSE
#3 snp6 Ch14      NA 34284723      CC      --      CC      TT            TRUE
#4 snp7 Ch13      NA 21532194      AA      GG      AA      GG            TRUE
#5 snp8 Ch13      NA 21532040      CC      AA      CC      AA            TRUE
#6 snp9 Ch11      NA 38423068      TT      CT      CC      CC            TRUE
#  sample4_sample2 sample4_sample3
#1            TRUE           FALSE
#2            TRUE           FALSE
#3           FALSE            TRUE
#4           FALSE            TRUE
#5           FALSE            TRUE
#6           FALSE           FALSE

数据

df <- structure(list(mi = c("snp1", "snp3", "snp6", "snp7", "snp8", 
"snp9"), chr = c("Ch09", "Ch02", "Ch14", "Ch13", "Ch13", "Ch11"
), gen.pos = c(NA, NA, NA, NA, NA, NA), phys.pos = c(12712760L, 
16594215L, 34284723L, 21532194L, 21532040L, 38423068L), sample1 = c("CC", 
"GG", "CC", "AA", "CC", "TT"), sample2 = c("CC", "HH", "--", 
"GG", "AA", "CT"), sample3 = c("CT", "GG", "CC", "AA", "CC", 
"CC"), sample4 = c("TT", "GG", "TT", "GG", "AA", "CC")), .Names = c("mi", 
"chr", "gen.pos", "phys.pos", "sample1", "sample2", "sample3", 
"sample4"), class = "data.frame", row.names = c(NA, -6L))