识别与df $ columnA中出现两次的值对应的行,然后在df $ columnB中指定一个值

时间:2018-05-22 18:28:38

标签: r dplyr bioinformatics tidyr mutate


FamilyID SampleID           MotherID            FatherID            Sex        
F1961    F1961-1_8005116592 F1961-3_8005116421  F1961-2_8005116603  1
F1961    F1961-2_8005116603 0                   0                   2   
F1961    F1961-3_8005116421 0                   0                   1   
0450     F350_8005441283    0                   0                   1   
0006     F355_8005441353    0                   0                   1   
0189     F359_8005441284    0                   0                   1   
0189     F359_8005441285    0                   0                   2


我需要添加一个在这些兄弟对之间共享的“dummy fatherID / motherID”,以进行下游分析。

我想具体确定FamilyID出现两次的样本,并为其分配一个共享的motherID / fatherID值,以便上面的示例如下所示:


FamilyID SampleID           MotherID            FatherID            Sex        
F1961    F1961-1_8005116592 F1961-3_8005116421  F1961-2_8005116603  1
F1961    F1961-2_8005116603 0                   0                   2   
F1961    F1961-3_8005116421 0                   0                   1   
0450     F350_8005441283    0                   0                   1   
0006     F355_8005441353    0                   0                   1   
0189     F359_8005441284    0189_mother         0189_father         1   
0189     F359_8005441285    0189_mother         0189_father         2   


pedigree %>% 
  mutate(FamilySize = count(Family_ID))

Error in mutate_impl(.data, dots) : Evaluation error: no applicable method for 'groups' applied to an object of class "character".


2 个答案:

答案 0 :(得分:1)


pedigree %>%
    group_by(FamilyID) %>%
    mutate(FamilySize = n()) %>%
    mutate(MotherID = if_else(FamilySize == 2 & MotherID == 0,
                              paste0(FamilyID, '_mother'),
           FatherID = if_else(FamilySize == 2 & FatherID == 0,
                              paste0(FamilyID, '_father'),

# A tibble: 7 x 6
  FamilyID SampleID           MotherID           FatherID             Sex FamilySize
  <chr>    <chr>              <chr>              <chr>              <int>      <int>
1 F1961    F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603     1          3
2 F1961    F1961-2_8005116603 0                  0                      2          3
3 F1961    F1961-3_8005116421 0                  0                      1          3
4 0450     F350_8005441283    0                  0                      1          1
5 0006     F355_8005441353    0                  0                      1          1
6 0189     F359_8005441284    0189_mother        0189_father            1          2
7 0189     F359_8005441285    0189_mother        0189_father            2          2

答案 1 :(得分:1)

您可以使用dplyrFamiliID进行分组,并更新条件n()==2的列(MotherID / FatherID)。

选项#1 :以OP的预期格式获取结果

df %>% group_by(FamilyID) %>%
  mutate(MotherID = ifelse(n() == 2, paste(FamilyID, "mother", sep= "_"), MotherID)) %>%
  mutate(FatherID = ifelse(n() == 2, paste(FamilyID, "father", sep= "_"), FatherID)) 

# FamilyID SampleID           MotherID           FatherID             Sex
# <chr>    <chr>              <chr>              <chr>              <int>
# 1 F1961    F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603     1
# 2 F1961    F1961-2_8005116603 F1961-3_8005116421 F1961-2_8005116603     2
# 3 F1961    F1961-3_8005116421 F1961-3_8005116421 F1961-2_8005116603     1
# 4 0450     F350_8005441283    0                  0                      1
# 5 0006     F355_8005441353    0                  0                      1
# 6 0189     F359_8005441284    0189_mother        0189_father            1
# 7 0189     F359_8005441285    0189_mother        0189_father            2



df %>% group_by(FamilyID) %>%
              funs(ifelse(n() == 2, paste(FamilyID, "dummy", sep= "_"), .)))

# # A tibble: 7 x 5
# # Groups: FamilyID [4]
# FamilyID SampleID           MotherID           FatherID             Sex
# <chr>    <chr>              <chr>              <chr>              <int>
# 1 F1961    F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603     1
# 2 F1961    F1961-2_8005116603 F1961-3_8005116421 F1961-2_8005116603     2
# 3 F1961    F1961-3_8005116421 F1961-3_8005116421 F1961-2_8005116603     1
# 4 0450     F350_8005441283    0                  0                      1
# 5 0006     F355_8005441353    0                  0                      1
# 6 0189     F359_8005441284    0189_dummy         0189_dummy             1
# 7 0189     F359_8005441285    0189_dummy         0189_dummy             2


df <- read.table(text = 
"FamilyID SampleID           MotherID            FatherID            Sex        
F1961    F1961-1_8005116592 F1961-3_8005116421  F1961-2_8005116603  1
F1961    F1961-2_8005116603 0                   0                   2   
F1961    F1961-3_8005116421 0                   0                   1   
0450     F350_8005441283    0                   0                   1   
0006     F355_8005441353    0                   0                   1   
0189     F359_8005441284    0                   0                   1   
0189     F359_8005441285    0                   0                   2",
header = TRUE, stringsAsFactors = FALSE)