我有一个包含三列的data.frame,第一列(诊断)是诊断HCM或DCM。第二列(结果)是以下任一种的遗传结果:致病性(P),可能致病性(LP),未知显着性变异(VUS)。在第三栏(Gene.Name)是他所涉及的基因的名称,即MYH7,RAF1,TTN等......
我遇到的问题是当一行涉及多个基因时,例如在结果栏下的第69行,它有LP,VUS'在Gene.Name列下,它有' RAF1(LP),TTN(VUS)'。
如何让R在第69行下方创建一个新行,以及在','之前的内容。在两个栏目中(' LP'' RAF1')留在一行并移动'(' VUS'和' TTN)到新行?同时保持诊断列不变
我希望能够对所有68行执行此操作,并记住有一些列,其中Result列中的结果适用于Gene.Name列中的多个基因,即“结果”列中的第6行' VUS',以及Gene.Name列' DSC2,LAMAS,TTN'所有这些都是VUS。
最终目标是制作显示每个基因频率的直方图,分为HCM和DCM以及致病性
我不确定R能否做到这一点,但我不知道,所以我想我会问。
提前致谢
这是我的数据的输入(它包含68个观察值,但它是来自另一个数据帧的子集,因此当我通过R放置时,我的示例中的第69行可能是第23行:
structure(list(Diagnosis = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 1L, 1L,
3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 1L, 1L), .Label = c("DCM",
"DCM ", "HCM"), class = "factor"), Result = structure(c(10L,
10L, 10L, 10L, 2L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 2L,
10L, 10L, 10L, 6L, 2L, 2L, 6L, 10L, 9L, 4L, 6L, 10L, 2L, 6L,
6L, 6L, 10L, 10L, 9L, 6L, 10L, 2L, 7L, 3L, 10L, 2L, 6L, 2L, 2L,
10L, 10L, 2L, 2L, 2L, 6L, 2L, 10L, 10L, 2L, 10L, 2L, 10L, 10L,
2L, 2L, 10L, 8L, 10L, 9L, 8L, 10L, 6L, 6L, 6L), .Label = c("",
"LP", "LP, VUS", "LP,VUS", "NIL", "P", "P, LB", "P, VUS", "P,VUS",
"VUS"), class = "factor"), Gene.Name = structure(c(41L, 10L,
41L, 12L, 39L, 41L, 38L, 4L, 42L, 27L, 17L, 37L, 24L, 18L, 9L,
3L, 26L, 14L, 25L, 34L, 15L, 13L, 7L, 35L, 32L, 30L, 19L, 32L,
21L, 29L, 31L, 6L, 20L, 19L, 29L, 19L, 33L, 22L, 41L, 21L, 29L,
19L, 19L, 19L, 16L, 16L, 16L, 40L, 1L, 21L, 2L, 8L, 41L, 41L,
39L, 41L, 36L, 38L, 21L, 11L, 23L, 39L, 5L, 23L, 28L, 19L, 21L,
21L), .Label = c("", "ACTA1", "ACTA1,TTN", "ACTN2,LAMA2,LDB3,NPPA,TTN",
"ALMS1/TXNRD2", "ANK2(P),BAG3, FKTN,MIB1, RYR2, SLC25A4, TTN,TTNn3",
"DES(P),DSC2(VUS),TTN(VUS)", "DES, TTN", "DMD,MYH6,PKP2,TTN",
"DSC2,LAMAS,TTN", "DSP", "EMD,TTN", "HRAS", "KRAS", "LMAP2",
"LMNA", "LMNA,ANKRD1,TTN,ACTN2", "MAP3K7", "MYBPC3", "MYEBC3(P),PRKAG2(VUS)",
"MYH7", "MYH7(LP),LAMP2(VUS), MYBPC3(VUS),TTN(VUS)", "MYH7(P),VCL(VUS)",
"MYH7, BAG3, NEXN", "MYH7, TTN", "MYL2", "PKP2, TTN", "PKP2,TTN",
"PRKAG2", "PRKAG2, TTN", "PRKAG2,MYH7", "PTPN11", "PTPN11(P), MYL3(LB)",
"RAF1", "RAF1(LP),TTN(VUS)", "RBM20", "TBX20,TTN", "TNNI3", "TNNT2",
"TPM1", "TTN", "TTN, DSG2"), class = "factor")), .Names = c("Diagnosis",
"Result", "Gene.Name"), row.names = c(1L, 6L, 7L, 8L, 11L, 12L,
14L, 15L, 17L, 18L, 22L, 25L, 29L, 30L, 32L, 33L, 47L, 57L, 59L,
63L, 64L, 67L, 68L, 69L, 70L, 75L, 79L, 80L, 81L, 82L, 84L, 86L,
88L, 89L, 92L, 93L, 95L, 98L, 100L, 101L, 106L, 107L, 109L, 110L,
111L, 112L, 113L, 114L, 115L, 116L, 119L, 127L, 130L, 132L, 133L,
134L, 137L, 138L, 139L, 140L, 141L, 142L, 145L, 148L, 150L, 151L,
152L, 153L), class = "data.frame")
答案 0 :(得分:4)
您可以尝试tidyverse
解决方案:
library(tidyverse)
d %>%
as.tibble(rownames = NULL) %>%
slice(c(1,8,23,32))
# A tibble: 4 x 3
Diagnosis Result Gene.Name
<fct> <fct> <fct>
1 DCM VUS TTN
2 DCM VUS ACTN2,LAMA2,LDB3,NPPA,TTN
3 HCM P,VUS DES(P),DSC2(VUS),TTN(VUS)
4 HCM VUS ANK2(P),BAG3, FKTN,MIB1, RYR2, SLC25A4, TTN,TTNn3
为了便于说明,我将焦点集中在一些示例行c(1,8,23,32)
上。
然后我使用了tidyr
的{{1}}和separate_rows
函数的组合。最后,我使用separate
清理了数据。
mutate
和使用完整数据的情节。
d %>%
as.tibble(rownames = NULL) %>%
slice(c(1,8,23,32)) %>%
separate_rows(Gene.Name, sep=",") %>%
separate(Gene.Name, into = c("Gene", "Res"), sep="[(]") %>%
mutate(Gene=str_trim(Gene),
Res=str_trim(gsub("[)]","",Res))) %>%
mutate(Res=ifelse(is.na(Res), as.character(Result), Res))
# A tibble: 17 x 4
Diagnosis Result Gene Res
<fct> <fct> <chr> <chr>
1 DCM VUS TTN VUS
2 DCM VUS ACTN2 VUS
3 DCM VUS LAMA2 VUS
4 DCM VUS LDB3 VUS
5 DCM VUS NPPA VUS
6 DCM VUS TTN VUS
7 HCM P,VUS DES P
8 HCM P,VUS DSC2 VUS
9 HCM P,VUS TTN VUS
10 HCM VUS ANK2 P
11 HCM VUS BAG3 VUS
12 HCM VUS FKTN VUS
13 HCM VUS MIB1 VUS
14 HCM VUS RYR2 VUS
15 HCM VUS SLC25A4 VUS
16 HCM VUS TTN VUS
17 HCM VUS TTNn3 VUS