使用带有模式检测的其他列替换字符列的值

时间:2018-05-01 20:29:39

标签: r grep dplyr bioinformatics stringr

我有一个数据框pedigrees的样本排列在家庭中:

pedigrees %>% 
  filter(Family %in% sample(pedigrees$Family, 5)


   Family_ID   Sample_ID                      fatherID       motherID         sex status
   <chr>       <chr>                          <chr>          <chr>          <int>  <int>
 1 MtS.MIPS.61 UCSF_AGG0092_8005439845        0              0                  2      0
 2 MtS.MIPS.61 UCSF_AGG0093_8005439857        0              0                  1      0
 3 MtS.MIPS.61 UCSF_AGG0094_8005439869        AGG0093        AGG0092            2      0
 4 MtS.MIPS.61 UCSF_AGG0095_8005439881        AGG0093        AGG0092            2      2
 5 MtS.MIPS.61 UCSF_AGG0091_8005439928        AGG0093        AGG0092            1      2
 6 FAM048      UCSF_G01-GEA-259-HI_8005440194 G01-GEA-259-PA G01-GEA-259-MA     1      2
 7 FAM048      UCSF_G01-GEA-259-MA_8005440206 0              0                  2      0
 8 FAM048      UCSF_G01-GEA-259-PA_8005440218 0              0                  1      0
 9 F1543       UCSF_F1543-1_8005116638        F1543-3        F1543-2            2      2
10 F1543       UCSF_F1543-2_8005116649        0              0                  2      0
11 F1543       UCSF_F1543-3_8005116661        0              0                  1      0
12 AU0045      UCSF_AU0045201_04C32032A       0              0                  1      0
13 AU0045      UCSF_AU0045202_04C32033A       0              0                  2      0
14 AU0045      UCSF_AU0045301_04C32034A       AU0045201      AU0045202          2      2
15 AU0045      UCSF_AU0045302_04C32035A       AU0045201      AU0045202          1      2
16 1232        UCSF_1232002_8004805191        1232011        1232012            2      2
17 1232        UCSF_1232011_8004805203        0              0                  1      1
18 1232        UCSF_1232012_8004805215        0              0                  2      1

Sample_ID的格式是列fatherIDmotherID也应该具有的格式,例如,最后一个家庭1232实际上看起来像这样:

16 1232        UCSF_1232002_8004805191        UCSF_1232011_8004805203        UCSF_1232012_8004805215            2      2
17 1232        UCSF_1232011_8004805203        0              0                  1      1
18 1232        UCSF_1232012_8004805215        0              0                  2      1

我知道我应该使用str_matchgrep,但我如何在pedigree的所有样本中应用此内容?

1 个答案:

答案 0 :(得分:2)

如果我理解正确的话。您可以使用group_by执行dplyr,然后根据mutate内是否等于0来替换fatherID和motherID。我使用grepl来查找与当前母亲/父亲ID匹配的Sample_ID。

library(dplyr)

pedigree %>% 
  group_by(Family_ID) %>% 
  mutate(motherID = ifelse(motherID != "0", 
                       Sample_ID[grepl(motherID[motherID != "0"][1], Sample_ID)], 
                       "0"), 
     fatherID = ifelse(fatherID != "0", 
                       Sample_ID[grepl(fatherID[fatherID != "0"][1], Sample_ID)], 
                       "0")
  ) 

# A tibble: 18 x 7
# Groups: Family_ID [5]
#       r Family_ID   Sample_ID                      fatherID                       motherID                    sex status
#   <int> <fct>       <chr>                          <chr>                          <chr>                     <int>  <int>
# 1     1 MtS.MIPS.61 UCSF_AGG0092_8005439845        0                              0                             2      0
# 2     2 MtS.MIPS.61 UCSF_AGG0093_8005439857        0                              0                             1      0
# 3     3 MtS.MIPS.61 UCSF_AGG0094_8005439869        UCSF_AGG0093_8005439857        UCSF_AGG0092_8005439845       2      0
# 4     4 MtS.MIPS.61 UCSF_AGG0095_8005439881        UCSF_AGG0093_8005439857        UCSF_AGG0092_8005439845       2      2
# 5     5 MtS.MIPS.61 UCSF_AGG0091_8005439928        UCSF_AGG0093_8005439857        UCSF_AGG0092_8005439845       1      2
# 6     6 FAM048      UCSF_G01-GEA-259-HI_8005440194 UCSF_G01-GEA-259-PA_8005440218 UCSF_G01-GEA-259-MA_8005~     1      2
# 7     7 FAM048      UCSF_G01-GEA-259-MA_8005440206 0                              0                             2      0
# 8     8 FAM048      UCSF_G01-GEA-259-PA_8005440218 0                              0                             1      0
# 9     9 F1543       UCSF_F1543-1_8005116638        UCSF_F1543-3_8005116661        UCSF_F1543-2_8005116649       2      2
#10    10 F1543       UCSF_F1543-2_8005116649        0                              0                             2      0
#11    11 F1543       UCSF_F1543-3_8005116661        0                              0                             1      0
#12    12 AU0045      UCSF_AU0045201_04C32032A       0                              0                             1      0
#13    13 AU0045      UCSF_AU0045202_04C32033A       0                              0                             2      0
#14    14 AU0045      UCSF_AU0045301_04C32034A       UCSF_AU0045201_04C32032A       UCSF_AU0045202_04C32033A      2      2
#15    15 AU0045      UCSF_AU0045302_04C32035A       UCSF_AU0045201_04C32032A       UCSF_AU0045202_04C32033A      1      2
#16    16 1232        UCSF_1232002_8004805191        UCSF_1232011_8004805203        UCSF_1232012_8004805215       2      2
#17    17 1232        UCSF_1232011_8004805203        0                              0                             1      1
#18    18 1232        UCSF_1232012_8004805215        0                              0                             2      1