将3个稀疏矩阵汇总为一个完整矩阵

时间:2019-04-21 00:22:27

标签: r matrix sparse-matrix

假设我有三个表,patientssamplesmutations

  • patients表具有唯一的行,每个行都有唯一的patient_id

  • samples表具有唯一的行,每个行都有唯一的sample_id,但也可以在Patients表中找到patient_id。样本表中可能有多行具有相同的patient_id

  • mutations表具有非唯一行。突变表中的每一行仅包含两列:genesample_id

我需要创建一个新表,将其命名为summary,第一列中的patient_id是sample_id,然后是突变表中每个不同基因的列。

新摘要表的每一行应包含

  • “患者”表中的patient_id
  • sample_id表中的samples
  • 1表中每个gene的每个后续gene列中的数字mutations,其中特定{{1} }或数字sample_id(如果没有)。

新的摘要表看起来像这样:

patient

对于0表中找到的每个不同patient_id, sample_id, gene A, gene B, gene C, gene D, etc 12345678,54321,1,0,0,0 23456789,65432,0,1,1,0 34567890,76543,0,0,1,0 34567890,87654,0,1,0,1 etc ,新的摘要表都必须具有一个01条目,即使没有条目也是如此在突变表中具有gene属于患者的特定行。

请记住,可能有多个样本属于同一患者,因此摘要表可能包含给定患者的多行-每行代表不同的样本。

感谢您的指导-R对我来说还比较陌生...:)

样本数据:

患者表:

PATIENT_ID,AGE,PARTC_CONSENTED_12_245,AGE_CURRENT,种族,宗教,民族,OS_STATUS,OS_MONTHS,PED_IND,性别,RECURRENCE,POD_FIRST_LINE,SYSTEMIC_TREATMENT,TIME_TO_LAST_FOLLOWUP     P-0000114,57,NO,59,白色,天主教/罗马,非西班牙语;非西班牙裔,已减少,15.16,否,女,0,是,宝石/牛+ HAI FUDR,15.16     P-0000127,62,NO,64,白色,无,非西班牙语;非西班牙裔,已减少,14.28,否,男性,0,是,宝石/顺式,14.28     P-0000147,40,NO,45,黑色,基督教,非西班牙语;非西班牙裔,生活,38.433,否,女性,0,是,宝石,38.45     P-0000154,76,NO,79,白色,JEWISH,非西班牙语;非西班牙裔,已减少,23.145,否,男性,0,是,宝石/顺式,23.52     P-0000159,67,NO,70,“其他亚洲人,包括亚洲人,NOS和东方人,NOS”,基督教,非西班牙文;非西班牙裔,已减少,18.773,否,女,0,是,gem / cis,18.78

样本表:

SAMPLE_ID,PATIENT_ID,HAS_MATCHED_NORMAL,TIME_TO_METASTASIS_MONTHS,SAMPLE_TYPE,SAMPLE_CLASS,METASTATIC_SITE,PRIMARY_SITE,ONCOTREE_CODE,GENE_PANEL,SO_COMMENTS,SAMPLE_COVERAGE,TUMOR_PURITY,MSI_COMMENT,MSI_SCORE,MSI_TYPE,学院,SOMATIC_STATUS,AGE_AT_SEQ_REPORT,射手,CVR_TMB_COHORT_PERCENTILE,CVR_TMB_SCORE,CVR_TMB_TT_COHORT_PERCENTILE,STAGE_4_DX P-0000114-T01-IM3,P-0000114,已匹配,0,转移,肿瘤,淋巴结,肝脏,IHCH,IMPACT341,938,60,不可用,0.47,稳定,MSKCC,已匹配,58,NO,58.6 ,4.5,75.9,是 P-0000114-T02-IM3,P-0000114,已匹配,0,主要,肿瘤,不适用,肝脏,IHCH,IMPACT341,409,60,不可用,0.26,稳定,MSKCC,已匹配,59,NO,58.6 ,4.5,75.9,是 P-0000127-T01-IM3,P-0000127,已匹配,0,转移,肿瘤,淋巴结,肝脏,IHCH,IMPACT341,623,30,不可用,0,稳定,MSKCC,已匹配,64,NO,29.9 ,2.2,36,是 P-0000127-T02-IM3,P-0000127,已匹配,0,转移,肿瘤,淋巴结,肝脏,IHCH,IMPACT341,255,0,不可用,0,稳定,MSKCC,已匹配,64,NO,29.9 ,2.2,36,是 P-0000147-T01-IM3,P-0000147,已匹配,25,主要,肿瘤,不适用,肝脏,IHCH,IMPACT341,1051,80,微卫星稳定(MSS)。请参阅下面的MSI注释。,0.17,稳定,MSKCC,匹配,41,否,0,0,0,否 P-0000154-T01-IM3,P-0000154,已匹配,0,主要,肿瘤,不适用,肝脏,IHCH,IMPACT341,767,70,不可用,1.2,稳定,MSKCC,已匹配,78,NO,44.1 ,3.3,59.4,是

变异表:

Hugo_Symbol,Tumor_Sample_Barcode BAP1,P-0009513-T01-IM5 PDGFRA,P-0000114-T01-IM5 BAP1,P-0009513-T01-IM5 卡拉斯,P-0000114-T02-IM3 CDKN1B,P-0000192-T02-IM3 IDH1,P-0000327-T01-IM3 ARID1A,P-0000327-T01-IM3 DOT1L,P-0000327-T01-IM3 缺口4,P-0001539-T01-IM3 ABL1,P-0001539-T01-IM3 SUFU,P-0001539-T01-IM3 PBRM1,P-0000114-T01-IM3 IDH1,P-0002143-T01-IM3 卡拉斯,P-0002143-T01-IM3 ARID1A,P-0000114-T01-IM3 MLL3,P-0000127-T01-IM3 ERBB3,P-0000117-T01-IM3 ARID1A,P-0002211-T01-IM3 TP53,P-0003407-T01-IM5 ARID1A,P-0000127-T01-IM3 ERBB3,P-000012707-T01-IM5 STAG2,P-0003407-T01-IM5 卡拉斯,P-0003473-T01-IM5 PBRM1,P-0003590-T01-IM5 TET2,P-0003590-T01-IM5 IDH1,P-0003795-T01-IM5 TP53,P-0003795-T01-IM5 SPEN,P-0003795-T01-IM5

1 个答案:

答案 0 :(得分:0)

这在MySQL中起到了作用:

select cs.patient_id, g.* from clynical_sample cs inner join ( select * from (with cp as (select * from (select gene from mutations group by gene) c cross join (select sample_id from clynical_sample group by sample_id) m) select cp.gene, cp.sample_id, ifnull(m.id,0) id from cp left join (select gene, sample_id, 1 id from mutations) m on m.gene=cp.gene and m.sample_id=cp.sample_id)
pivot ( max(id) for gene in ('BAP1','PDGFRA','KRAS','CDKN1B','IDH1','ARID1A','DOT1L','NOTCH4','ABL1','PBRM1','MLL3','TET2','SPEN','CCND2','DDR2','RICTOR','SMAD4','GLI1','RASA1','MAP2K1','CSF3R','HIST1H3D','DNMT3B','CEBPA','GATA2','ARID1B','BRCA2','EPHA7','CTNNB1','EPHA5','EP300','RAF1','NF1','EGFR','NBN','INHA','CARD11','ANKRD11','ERBB3','TERT','DNMT1','ATM','RIT1','PDCD1','SMARCA4','FOXP1','DICER1','TGFBR2','PTPRS','FANCC','APC','NCOA3','NTRK1','PTPRD','NSD1','GRIN2A','SMARCB1','PTCH1','KEAP1','KDR','IRS2','PIK3R3','SUFU','STAG2','MAP3K13','SOX9','SETD2','FAT1','ZFHX3','NRAS','MAP3K1','ERBB4','JAK3','NF2','PGR','KDM6A','RPTOR','TP53','CIC','MSH2','MAP2K4','AXIN2','PTEN','XPO1','ERCC4','AXL','RNF43','DNMT3A','ERG','NOTCH2','RFWD2','IGF1R','GATA1','SMAD3','TMPRSS2','MLL','BRAF','TET1','BCOR','YAP1','HLA-A','PLCG2','CBL','IRS1','PIK3CA','POLE','LATS2','MST1','H3F3B','IRF4','AR','B2M','NCOR1','FUBP1','NOTCH3','ATR','RPS6KB2','TSC2','PIK3CG','MDM2','ROS1','TCF3','TSC1','FGFR2','FBXW7','FOXA1','MEN1','CDKN2Ap16INK4A','EPHA3','PMS1','PAK1','E2F3','PIK3CD','PLK2','MPL','RHEB','RBM10','ASXL2','MSH6','RAD21','BRIP1','PTPRT','GNA11','CDKN1A','RAD50','BRD4','STK11','ARID2','RUNX1','MTOR','JAK1','TBX3','MALT1','RYBP','MLL2','PIK3CB','SMO','AXIN1','MAPK3','VHL','JUN','KDM5A','ARID5B','AMER1','PPM1D','ASXL1','MLH1','CASP8','BARD1','DAXX','CDH1','PALB2','AKT3','RECQL4','IGF2','MED12','FLT3','HIST3H3','MST1R','EIF4A2','CREBBP','STAT5B','PHOX2B','BRCA1','ERBB2','MITF','RB1','CD79A','TMEM127','MAPK1','CDKN2A','CDKN2Ap14ARF','CSF1R','FLT4','CENPA','RPS6KA4','SRC','ERCC3','NEGR1','RET','ACVR1','SYK','ICOSLG','FYN','SOX17','ETV6','NTRK3','HIST1H1C','IDH2','CHEK1','GNAS','PPP6C','EZH2','MYCL1','SDHA','MDC1','ARAF','RAC1','KDM5C','PARP1','NKX2-1','CXCR4','SMAD2','IL7R','TGFBR1','U2AF1','SF3B1','FGFR4','ERRFI1','SMARCD1','FGFR1','EPHB1','PDPK1','FLCN','RAD54L','MGA','PPP2R1A')) ) g on g.sample_id = cs.sample_id;