吐出data.frame中的数据除以逗号分隔到新行

时间:2018-04-10 08:46:20

标签: r

我有一个包含三列的data.frame,第一列(诊断)是诊断HCM或DCM。第二列(结果)是以下任一种的遗传结果:致病性(P),可能致病性(LP),未知显着性变异(VUS)。在第三栏(Gene.Name)是他所涉及的基因的名称,即MYH7,RAF1,TTN等......

我遇到的问题是当一行涉及多个基因时,例如在结果栏下的第69行,它有LP,VUS'在Gene.Name列下,它有' RAF1(LP),TTN(VUS)'。

如何让R在第69行下方创建一个新行,以及在','之前的内容。在两个栏目中(' LP'' RAF1')留在一行并移动'(' VUS'和' TTN)到新行?同时保持诊断列不变

我希望能够对所有68行执行此操作,并记住有一些列,其中Result列中的结果适用于Gene.Name列中的多个基因,即“结果”列中的第6行' VUS',以及Gene.Name列' DSC2,LAMAS,TTN'所有这些都是VUS。

最终目标是制作显示每个基因频率的直方图,分为HCM和DCM以及致病性

我不确定R能否做到这一点,但我不知道,所以我想我会问。

提前致谢

这是我的数据的输入(它包含68个观察值,但它是来自另一个数据帧的子集,因此当我通过R放置时,我的示例中的第69行可能是第23行:

structure(list(Diagnosis = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 
3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 1L, 1L), .Label = c("DCM", 
"DCM ", "HCM"), class = "factor"), Result = structure(c(10L, 
10L, 10L, 10L, 2L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 2L, 
10L, 10L, 10L, 6L, 2L, 2L, 6L, 10L, 9L, 4L, 6L, 10L, 2L, 6L, 
6L, 6L, 10L, 10L, 9L, 6L, 10L, 2L, 7L, 3L, 10L, 2L, 6L, 2L, 2L, 
10L, 10L, 2L, 2L, 2L, 6L, 2L, 10L, 10L, 2L, 10L, 2L, 10L, 10L, 
2L, 2L, 10L, 8L, 10L, 9L, 8L, 10L, 6L, 6L, 6L), .Label = c("", 
"LP", "LP, VUS", "LP,VUS", "NIL", "P", "P, LB", "P, VUS", "P,VUS", 
"VUS"), class = "factor"), Gene.Name = structure(c(41L, 10L, 
41L, 12L, 39L, 41L, 38L, 4L, 42L, 27L, 17L, 37L, 24L, 18L, 9L, 
3L, 26L, 14L, 25L, 34L, 15L, 13L, 7L, 35L, 32L, 30L, 19L, 32L, 
21L, 29L, 31L, 6L, 20L, 19L, 29L, 19L, 33L, 22L, 41L, 21L, 29L, 
19L, 19L, 19L, 16L, 16L, 16L, 40L, 1L, 21L, 2L, 8L, 41L, 41L, 
39L, 41L, 36L, 38L, 21L, 11L, 23L, 39L, 5L, 23L, 28L, 19L, 21L, 
21L), .Label = c("", "ACTA1", "ACTA1,TTN", "ACTN2,LAMA2,LDB3,NPPA,TTN", 
"ALMS1/TXNRD2", "ANK2(P),BAG3, FKTN,MIB1, RYR2, SLC25A4, TTN,TTNn3", 
"DES(P),DSC2(VUS),TTN(VUS)", "DES, TTN", "DMD,MYH6,PKP2,TTN", 
"DSC2,LAMAS,TTN", "DSP", "EMD,TTN", "HRAS", "KRAS", "LMAP2", 
"LMNA", "LMNA,ANKRD1,TTN,ACTN2", "MAP3K7", "MYBPC3", "MYEBC3(P),PRKAG2(VUS)", 
"MYH7", "MYH7(LP),LAMP2(VUS), MYBPC3(VUS),TTN(VUS)", "MYH7(P),VCL(VUS)", 
"MYH7, BAG3, NEXN", "MYH7, TTN", "MYL2", "PKP2, TTN", "PKP2,TTN", 
"PRKAG2", "PRKAG2, TTN", "PRKAG2,MYH7", "PTPN11", "PTPN11(P), MYL3(LB)", 
"RAF1", "RAF1(LP),TTN(VUS)", "RBM20", "TBX20,TTN", "TNNI3", "TNNT2", 
"TPM1", "TTN", "TTN, DSG2"), class = "factor")), .Names = c("Diagnosis", 
"Result", "Gene.Name"), row.names = c(1L, 6L, 7L, 8L, 11L, 12L, 
14L, 15L, 17L, 18L, 22L, 25L, 29L, 30L, 32L, 33L, 47L, 57L, 59L, 
63L, 64L, 67L, 68L, 69L, 70L, 75L, 79L, 80L, 81L, 82L, 84L, 86L, 
88L, 89L, 92L, 93L, 95L, 98L, 100L, 101L, 106L, 107L, 109L, 110L, 
111L, 112L, 113L, 114L, 115L, 116L, 119L, 127L, 130L, 132L, 133L, 
134L, 137L, 138L, 139L, 140L, 141L, 142L, 145L, 148L, 150L, 151L, 
152L, 153L), class = "data.frame")

1 个答案:

答案 0 :(得分:4)

您可以尝试tidyverse解决方案:

library(tidyverse)
d %>% 
  as.tibble(rownames = NULL) %>% 
  slice(c(1,8,23,32)) 
# A tibble: 4 x 3
  Diagnosis Result Gene.Name                                        
  <fct>     <fct>  <fct>                                            
1 DCM       VUS    TTN                                              
2 DCM       VUS    ACTN2,LAMA2,LDB3,NPPA,TTN                        
3 HCM       P,VUS  DES(P),DSC2(VUS),TTN(VUS)                        
4 HCM       VUS    ANK2(P),BAG3, FKTN,MIB1, RYR2, SLC25A4, TTN,TTNn3

为了便于说明,我将焦点集中在一些示例行c(1,8,23,32)上。 然后我使用了tidyr的{​​{1}}和separate_rows函数的组合。最后,我使用separate清理了数据。

mutate

和使用完整数据的情节。

d %>% 
  as.tibble(rownames = NULL) %>% 
  slice(c(1,8,23,32)) %>% 
  separate_rows(Gene.Name, sep=",") %>% 
  separate(Gene.Name, into = c("Gene", "Res"), sep="[(]") %>% 
  mutate(Gene=str_trim(Gene),
         Res=str_trim(gsub("[)]","",Res))) %>% 
  mutate(Res=ifelse(is.na(Res), as.character(Result), Res))
# A tibble: 17 x 4
   Diagnosis Result Gene    Res  
   <fct>     <fct>  <chr>   <chr>
 1 DCM       VUS    TTN     VUS  
 2 DCM       VUS    ACTN2   VUS  
 3 DCM       VUS    LAMA2   VUS  
 4 DCM       VUS    LDB3    VUS  
 5 DCM       VUS    NPPA    VUS  
 6 DCM       VUS    TTN     VUS  
 7 HCM       P,VUS  DES     P    
 8 HCM       P,VUS  DSC2    VUS  
 9 HCM       P,VUS  TTN     VUS  
10 HCM       VUS    ANK2    P    
11 HCM       VUS    BAG3    VUS  
12 HCM       VUS    FKTN    VUS  
13 HCM       VUS    MIB1    VUS  
14 HCM       VUS    RYR2    VUS  
15 HCM       VUS    SLC25A4 VUS  
16 HCM       VUS    TTN     VUS  
17 HCM       VUS    TTNn3   VUS 

enter image description here