识别与df $ columnA中出现两次的值对应的行,然后在df $ columnB中指定一个值

时间:2018-05-22 18:28:38

标签: r dplyr bioinformatics tidyr mutate

我有一个如下所示的数据框pedigree

FamilyID SampleID           MotherID            FatherID            Sex        
F1961    F1961-1_8005116592 F1961-3_8005116421  F1961-2_8005116603  1
F1961    F1961-2_8005116603 0                   0                   2   
F1961    F1961-3_8005116421 0                   0                   1   
0450     F350_8005441283    0                   0                   1   
0006     F355_8005441353    0                   0                   1   
0189     F359_8005441284    0                   0                   1   
0189     F359_8005441285    0                   0                   2
.
.
.

某些FamilyIDs(例如0189)会出现两次,而这些对应于兄弟对,其父母的信息未被记录。

我需要添加一个在这些兄弟对之间共享的“dummy fatherID / motherID”,以进行下游分析。

我想具体确定FamilyID出现两次的样本,并为其分配一个共享的motherID / fatherID值,以便上面的示例如下所示:

期望输出:

FamilyID SampleID           MotherID            FatherID            Sex        
F1961    F1961-1_8005116592 F1961-3_8005116421  F1961-2_8005116603  1
F1961    F1961-2_8005116603 0                   0                   2   
F1961    F1961-3_8005116421 0                   0                   1   
0450     F350_8005441283    0                   0                   1   
0006     F355_8005441353    0                   0                   1   
0189     F359_8005441284    0189_mother         0189_father         1   
0189     F359_8005441285    0189_mother         0189_father         2   
.
.
.

到目前为止,我已尝试从mutate开始添加一个列,指示每个FamilyID被观察了多少次,但这不起作用:

pedigree %>% 
  mutate(FamilySize = count(Family_ID))

Error in mutate_impl(.data, dots) : Evaluation error: no applicable method for 'groups' applied to an object of class "character".

非常感谢你的帮助。

2 个答案:

答案 0 :(得分:1)

要计算家庭规模,我们希望按FamilyID对其进行分组,然后使用n()计算每个组中的行数。然后,我们可以mutateif_else一起使用,以根据需要替换MotherIDFatherID的值。实际上,我们可以将表格按FamilyID分组,因为我们在mutate语句中使用的所有变量(FamilySizeFatherIDMotherID)整个集团都是一样的。如果他们不是(例如,如果我们想根据Sex执行不同的操作),那么我们希望切换到rowwise,以便mutate将应用if_else函数每行逐个,而不是单个矢量化计算。

pedigree %>%
    group_by(FamilyID) %>%
    mutate(FamilySize = n()) %>%
    mutate(MotherID = if_else(FamilySize == 2 & MotherID == 0,
                              paste0(FamilyID, '_mother'),
                              MotherID),
           FatherID = if_else(FamilySize == 2 & FatherID == 0,
                              paste0(FamilyID, '_father'),
                              FatherID))

# A tibble: 7 x 6
  FamilyID SampleID           MotherID           FatherID             Sex FamilySize
  <chr>    <chr>              <chr>              <chr>              <int>      <int>
1 F1961    F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603     1          3
2 F1961    F1961-2_8005116603 0                  0                      2          3
3 F1961    F1961-3_8005116421 0                  0                      1          3
4 0450     F350_8005441283    0                  0                      1          1
5 0006     F355_8005441353    0                  0                      1          1
6 0189     F359_8005441284    0189_mother        0189_father            1          2
7 0189     F359_8005441285    0189_mother        0189_father            2          2

答案 1 :(得分:1)

您可以使用dplyrFamiliID进行分组,并更新条件n()==2的列(MotherID / FatherID)。

选项#1 :以OP的预期格式获取结果

library(dplyr)
df %>% group_by(FamilyID) %>%
  mutate(MotherID = ifelse(n() == 2, paste(FamilyID, "mother", sep= "_"), MotherID)) %>%
  mutate(FatherID = ifelse(n() == 2, paste(FamilyID, "father", sep= "_"), FatherID)) 

# FamilyID SampleID           MotherID           FatherID             Sex
# <chr>    <chr>              <chr>              <chr>              <int>
# 1 F1961    F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603     1
# 2 F1961    F1961-2_8005116603 F1961-3_8005116421 F1961-2_8005116603     2
# 3 F1961    F1961-3_8005116421 F1961-3_8005116421 F1961-2_8005116603     1
# 4 0450     F350_8005441283    0                  0                      1
# 5 0006     F355_8005441353    0                  0                      1
# 6 0189     F359_8005441284    0189_mother        0189_father            1
# 7 0189     F359_8005441285    0189_mother        0189_father            2

选项#2:如果OP很高兴拥有FamilyID_dummy形式的虚拟ID,那么使用mutate_at可以实现更优雅的解决方案:

library(dplyr)

df %>% group_by(FamilyID) %>%
  mutate_at(vars(c("MotherID","FatherID")), 
              funs(ifelse(n() == 2, paste(FamilyID, "dummy", sep= "_"), .)))

# # A tibble: 7 x 5
# # Groups: FamilyID [4]
# FamilyID SampleID           MotherID           FatherID             Sex
# <chr>    <chr>              <chr>              <chr>              <int>
# 1 F1961    F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603     1
# 2 F1961    F1961-2_8005116603 F1961-3_8005116421 F1961-2_8005116603     2
# 3 F1961    F1961-3_8005116421 F1961-3_8005116421 F1961-2_8005116603     1
# 4 0450     F350_8005441283    0                  0                      1
# 5 0006     F355_8005441353    0                  0                      1
# 6 0189     F359_8005441284    0189_dummy         0189_dummy             1
# 7 0189     F359_8005441285    0189_dummy         0189_dummy             2

数据:

df <- read.table(text = 
"FamilyID SampleID           MotherID            FatherID            Sex        
F1961    F1961-1_8005116592 F1961-3_8005116421  F1961-2_8005116603  1
F1961    F1961-2_8005116603 0                   0                   2   
F1961    F1961-3_8005116421 0                   0                   1   
0450     F350_8005441283    0                   0                   1   
0006     F355_8005441353    0                   0                   1   
0189     F359_8005441284    0                   0                   1   
0189     F359_8005441285    0                   0                   2",
header = TRUE, stringsAsFactors = FALSE)