Question

我有一个如下所示的数据框pedigree：

FamilyID SampleID           MotherID            FatherID            Sex        
F1961    F1961-1_8005116592 F1961-3_8005116421  F1961-2_8005116603  1
F1961    F1961-2_8005116603 0                   0                   2   
F1961    F1961-3_8005116421 0                   0                   1   
0450     F350_8005441283    0                   0                   1   
0006     F355_8005441353    0                   0                   1   
0189     F359_8005441284    0                   0                   1   
0189     F359_8005441285    0                   0                   2
.
.
.

某些FamilyIDs（例如0189）会出现两次，而这些对应于兄弟对，其父母的信息未被记录。

我需要添加一个在这些兄弟对之间共享的“dummy fatherID / motherID”，以进行下游分析。

我想具体确定FamilyID出现两次的样本，并为其分配一个共享的motherID / fatherID值，以便上面的示例如下所示：

期望输出：

FamilyID SampleID           MotherID            FatherID            Sex        
F1961    F1961-1_8005116592 F1961-3_8005116421  F1961-2_8005116603  1
F1961    F1961-2_8005116603 0                   0                   2   
F1961    F1961-3_8005116421 0                   0                   1   
0450     F350_8005441283    0                   0                   1   
0006     F355_8005441353    0                   0                   1   
0189     F359_8005441284    0189_mother         0189_father         1   
0189     F359_8005441285    0189_mother         0189_father         2   
.
.
.

到目前为止，我已尝试从mutate开始添加一个列，指示每个FamilyID被观察了多少次，但这不起作用：

pedigree %>% 
  mutate(FamilySize = count(Family_ID))

Error in mutate_impl(.data, dots) : Evaluation error: no applicable method for 'groups' applied to an object of class "character".

非常感谢你的帮助。

Answer 1

要计算家庭规模，我们希望按FamilyID对其进行分组，然后使用n()计算每个组中的行数。然后，我们可以mutate与if_else一起使用，以根据需要替换MotherID或FatherID的值。实际上，我们可以将表格按FamilyID分组，因为我们在mutate语句中使用的所有变量（FamilySize，FatherID和MotherID）整个集团都是一样的。如果他们不是（例如，如果我们想根据Sex执行不同的操作），那么我们希望切换到rowwise，以便mutate将应用if_else函数每行逐个，而不是单个矢量化计算。

pedigree %>%
    group_by(FamilyID) %>%
    mutate(FamilySize = n()) %>%
    mutate(MotherID = if_else(FamilySize == 2 & MotherID == 0,
                              paste0(FamilyID, '_mother'),
                              MotherID),
           FatherID = if_else(FamilySize == 2 & FatherID == 0,
                              paste0(FamilyID, '_father'),
                              FatherID))

# A tibble: 7 x 6
  FamilyID SampleID           MotherID           FatherID             Sex FamilySize
  <chr>    <chr>              <chr>              <chr>              <int>      <int>
1 F1961    F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603     1          3
2 F1961    F1961-2_8005116603 0                  0                      2          3
3 F1961    F1961-3_8005116421 0                  0                      1          3
4 0450     F350_8005441283    0                  0                      1          1
5 0006     F355_8005441353    0                  0                      1          1
6 0189     F359_8005441284    0189_mother        0189_father            1          2
7 0189     F359_8005441285    0189_mother        0189_father            2          2

Answer 2

您可以使用dplyr对FamiliID进行分组，并更新条件n()==2的列（MotherID / FatherID）。

选项＃1 ：以OP的预期格式获取结果

library(dplyr)
df %>% group_by(FamilyID) %>%
  mutate(MotherID = ifelse(n() == 2, paste(FamilyID, "mother", sep= "_"), MotherID)) %>%
  mutate(FatherID = ifelse(n() == 2, paste(FamilyID, "father", sep= "_"), FatherID)) 

# FamilyID SampleID           MotherID           FatherID             Sex
# <chr>    <chr>              <chr>              <chr>              <int>
# 1 F1961    F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603     1
# 2 F1961    F1961-2_8005116603 F1961-3_8005116421 F1961-2_8005116603     2
# 3 F1961    F1961-3_8005116421 F1961-3_8005116421 F1961-2_8005116603     1
# 4 0450     F350_8005441283    0                  0                      1
# 5 0006     F355_8005441353    0                  0                      1
# 6 0189     F359_8005441284    0189_mother        0189_father            1
# 7 0189     F359_8005441285    0189_mother        0189_father            2

选项＃2：如果OP很高兴拥有FamilyID_dummy形式的虚拟ID，那么使用mutate_at可以实现更优雅的解决方案：

library(dplyr)

df %>% group_by(FamilyID) %>%
  mutate_at(vars(c("MotherID","FatherID")), 
              funs(ifelse(n() == 2, paste(FamilyID, "dummy", sep= "_"), .)))

# # A tibble: 7 x 5
# # Groups: FamilyID [4]
# FamilyID SampleID           MotherID           FatherID             Sex
# <chr>    <chr>              <chr>              <chr>              <int>
# 1 F1961    F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603     1
# 2 F1961    F1961-2_8005116603 F1961-3_8005116421 F1961-2_8005116603     2
# 3 F1961    F1961-3_8005116421 F1961-3_8005116421 F1961-2_8005116603     1
# 4 0450     F350_8005441283    0                  0                      1
# 5 0006     F355_8005441353    0                  0                      1
# 6 0189     F359_8005441284    0189_dummy         0189_dummy             1
# 7 0189     F359_8005441285    0189_dummy         0189_dummy             2

数据：

df <- read.table(text = "FamilyID SampleID MotherID FatherID Sex F1961 F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603 1 F1961 F1961-2_8005116603 0 0 2 F1961 F1961-3_8005116421 0 0 1 0450 F350_8005441283 0 0 1 0006 F355_8005441353 0 0 1 0189 F359_8005441284 0 0 1 0189 F359_8005441285 0 0 2", header = TRUE, stringsAsFactors = FALSE)

识别与df $ columnA中出现两次的值对应的行，然后在df $ columnB中指定一个值

2 个答案: