我有一个如下所示的数据框pedigree
:
FamilyID SampleID MotherID FatherID Sex
F1961 F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603 1
F1961 F1961-2_8005116603 0 0 2
F1961 F1961-3_8005116421 0 0 1
0450 F350_8005441283 0 0 1
0006 F355_8005441353 0 0 1
0189 F359_8005441284 0 0 1
0189 F359_8005441285 0 0 2
.
.
.
某些FamilyIDs
(例如0189
)会出现两次,而这些对应于兄弟对,其父母的信息未被记录。
我需要添加一个在这些兄弟对之间共享的“dummy fatherID / motherID”,以进行下游分析。
我想具体确定FamilyID
出现两次的样本,并为其分配一个共享的motherID
/ fatherID
值,以便上面的示例如下所示:
期望输出:
FamilyID SampleID MotherID FatherID Sex
F1961 F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603 1
F1961 F1961-2_8005116603 0 0 2
F1961 F1961-3_8005116421 0 0 1
0450 F350_8005441283 0 0 1
0006 F355_8005441353 0 0 1
0189 F359_8005441284 0189_mother 0189_father 1
0189 F359_8005441285 0189_mother 0189_father 2
.
.
.
到目前为止,我已尝试从mutate
开始添加一个列,指示每个FamilyID
被观察了多少次,但这不起作用:
pedigree %>%
mutate(FamilySize = count(Family_ID))
Error in mutate_impl(.data, dots) : Evaluation error: no applicable method for 'groups' applied to an object of class "character".
非常感谢你的帮助。
答案 0 :(得分:1)
要计算家庭规模,我们希望按FamilyID
对其进行分组,然后使用n()
计算每个组中的行数。然后,我们可以mutate
与if_else
一起使用,以根据需要替换MotherID
或FatherID
的值。实际上,我们可以将表格按FamilyID
分组,因为我们在mutate语句中使用的所有变量(FamilySize
,FatherID
和MotherID
)整个集团都是一样的。如果他们不是(例如,如果我们想根据Sex
执行不同的操作),那么我们希望切换到rowwise
,以便mutate将应用if_else
函数每行逐个,而不是单个矢量化计算。
pedigree %>%
group_by(FamilyID) %>%
mutate(FamilySize = n()) %>%
mutate(MotherID = if_else(FamilySize == 2 & MotherID == 0,
paste0(FamilyID, '_mother'),
MotherID),
FatherID = if_else(FamilySize == 2 & FatherID == 0,
paste0(FamilyID, '_father'),
FatherID))
# A tibble: 7 x 6
FamilyID SampleID MotherID FatherID Sex FamilySize
<chr> <chr> <chr> <chr> <int> <int>
1 F1961 F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603 1 3
2 F1961 F1961-2_8005116603 0 0 2 3
3 F1961 F1961-3_8005116421 0 0 1 3
4 0450 F350_8005441283 0 0 1 1
5 0006 F355_8005441353 0 0 1 1
6 0189 F359_8005441284 0189_mother 0189_father 1 2
7 0189 F359_8005441285 0189_mother 0189_father 2 2
答案 1 :(得分:1)
您可以使用dplyr
对FamiliID
进行分组,并更新条件n()==2
的列(MotherID / FatherID)。
选项#1 :以OP的预期格式获取结果
library(dplyr)
df %>% group_by(FamilyID) %>%
mutate(MotherID = ifelse(n() == 2, paste(FamilyID, "mother", sep= "_"), MotherID)) %>%
mutate(FatherID = ifelse(n() == 2, paste(FamilyID, "father", sep= "_"), FatherID))
# FamilyID SampleID MotherID FatherID Sex
# <chr> <chr> <chr> <chr> <int>
# 1 F1961 F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603 1
# 2 F1961 F1961-2_8005116603 F1961-3_8005116421 F1961-2_8005116603 2
# 3 F1961 F1961-3_8005116421 F1961-3_8005116421 F1961-2_8005116603 1
# 4 0450 F350_8005441283 0 0 1
# 5 0006 F355_8005441353 0 0 1
# 6 0189 F359_8005441284 0189_mother 0189_father 1
# 7 0189 F359_8005441285 0189_mother 0189_father 2
选项#2:如果OP很高兴拥有FamilyID_dummy
形式的虚拟ID,那么使用mutate_at
可以实现更优雅的解决方案:
library(dplyr)
df %>% group_by(FamilyID) %>%
mutate_at(vars(c("MotherID","FatherID")),
funs(ifelse(n() == 2, paste(FamilyID, "dummy", sep= "_"), .)))
# # A tibble: 7 x 5
# # Groups: FamilyID [4]
# FamilyID SampleID MotherID FatherID Sex
# <chr> <chr> <chr> <chr> <int>
# 1 F1961 F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603 1
# 2 F1961 F1961-2_8005116603 F1961-3_8005116421 F1961-2_8005116603 2
# 3 F1961 F1961-3_8005116421 F1961-3_8005116421 F1961-2_8005116603 1
# 4 0450 F350_8005441283 0 0 1
# 5 0006 F355_8005441353 0 0 1
# 6 0189 F359_8005441284 0189_dummy 0189_dummy 1
# 7 0189 F359_8005441285 0189_dummy 0189_dummy 2
数据:强>
df <- read.table(text =
"FamilyID SampleID MotherID FatherID Sex
F1961 F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603 1
F1961 F1961-2_8005116603 0 0 2
F1961 F1961-3_8005116421 0 0 1
0450 F350_8005441283 0 0 1
0006 F355_8005441353 0 0 1
0189 F359_8005441284 0 0 1
0189 F359_8005441285 0 0 2",
header = TRUE, stringsAsFactors = FALSE)