我正在使用MySQL尝试操纵一些数据进行癌症研究和机器学习。对于PIVOT语句来说,这似乎是一个理想的问题,但我无法完全解决它的问题,因此欢迎您提供任何帮助。如果有更好的工具(例如R),我也很高兴。
假设我有三个表格,患者,样本和突变:
Patients表具有唯一的行,每个行都有唯一的Patient_id。
samples表具有唯一的行,每个行都有唯一的sample_id,但也可以在Patients表中找到Patient_id。样本表中可能有多行具有相同的Patient_id。
变异表具有非唯一行。突变表中的每一行仅包含两列:gene和sample_id。
我需要创建一个新表,将其称为摘要表,第一列有sample_id的Patient_id,然后是突变表中每个不同基因的列。
新摘要表的每一行应包含
新的摘要表看起来像这样:
patient_id sample_id gene A gene B gene C gene D etc
12345678 54321 1 0 0 0
23456789 65432 0 1 1 0
34567890 76543 0 0 1 0
34567890 87654 0 1 0 1
etc
对于突变表中找到的每个不同基因,新的摘要表都必须有一个0或1条目,即使在突变表中没有特定条目中属于患者的sample_id的条目也是如此
请记住,可能有多个样本属于同一患者,因此摘要表可能包含给定患者的多行-每行代表不同的样本。
这是我当前无法使用的SQL:
SELECT cs.patient_id, g.*
FROM samples cs
INNER JOIN (
SELECT *
FROM
(WITH cp AS
(SELECT * FROM
(SELECT gene FROM mutations GROUP BY gene) c
CROSS JOIN (SELECT sample_id FROM samples GROUP BY sample_id) m)
SELECT cp.gene, cp.sample_id, IFNULL(m.id,0) id
FROM cp
LEFT JOIN (SELECT gene, sample_id, 1 id FROM mutations) m on m.gene=cp.gene and m.sample_id=cp.sample_id)
PIVOT ( MAX(id) for gene in ('BAP1','PDGFRA','KRAS','CDKN1B','IDH1','ARID1A','DOT1L','NOTCH4','ABL1',
'PBRM1','MLL3','TET2','SPEN','CCND2','DDR2','RICTOR','SMAD4','GLI1','RASA1',
'MAP2K1','CSF3R','HIST1H3D','DNMT3B','CEBPA','GATA2','ARID1B','BRCA2','EPHA7',
'CTNNB1','EPHA5','EP300','RAF1','NF1','EGFR','NBN','INHA','CARD11','ANKRD11',
'ERBB3','TERT','DNMT1','ATM','RIT1','PDCD1','SMARCA4','FOXP1','DICER1','TGFBR2',
'PTPRS','FANCC','APC','NCOA3','NTRK1','PTPRD','NSD1','GRIN2A','SMARCB1','PTCH1',
'KEAP1','KDR','IRS2','PIK3R3','SUFU','STAG2','MAP3K13','SOX9','SETD2','FAT1',
'ZFHX3','NRAS','MAP3K1','ERBB4','JAK3','NF2','PGR','KDM6A','RPTOR','TP53','CIC',
'MSH2','MAP2K4','AXIN2','PTEN','XPO1','ERCC4','AXL','RNF43','DNMT3A','ERG','NOTCH2',
'RFWD2','IGF1R','GATA1','SMAD3','TMPRSS2','MLL','BRAF','TET1','BCOR','YAP1','HLA-A',
'PLCG2','CBL','IRS1','PIK3CA','POLE','LATS2','MST1','H3F3B','IRF4','AR','B2M','NCOR1',
'FUBP1','NOTCH3','ATR','RPS6KB2','TSC2','PIK3CG','MDM2','ROS1','TCF3','TSC1','FGFR2',
'FBXW7','FOXA1','MEN1','CDKN2Ap16INK4A','EPHA3','PMS1','PAK1','E2F3','PIK3CD','PLK2',
'MPL','RHEB','RBM10','ASXL2','MSH6','RAD21','BRIP1','PTPRT','GNA11','CDKN1A','RAD50',
'BRD4','STK11','ARID2','RUNX1','MTOR','JAK1','TBX3','MALT1','RYBP','MLL2','PIK3CB',
'SMO','AXIN1','MAPK3','VHL','JUN','KDM5A','ARID5B','AMER1','PPM1D','ASXL1','MLH1',
'CASP8','BARD1','DAXX','CDH1','PALB2','AKT3','RECQL4','IGF2','MED12','FLT3','HIST3H3',
'MST1R','EIF4A2','CREBBP','STAT5B','PHOX2B','BRCA1','ERBB2','MITF','RB1','CD79A',
'TMEM127','MAPK1','CDKN2A','CDKN2Ap14ARF','CSF1R','FLT4','CENPA','RPS6KA4','SRC',
'ERCC3','NEGR1','RET','ACVR1','SYK','ICOSLG','FYN','SOX17','ETV6','NTRK3','HIST1H1C',
'IDH2','CHEK1','GNAS','PPP6C','EZH2','MYCL1','SDHA','MDC1','ARAF','RAC1','KDM5C','PARP1',
'NKX2-1','CXCR4','SMAD2','IL7R','TGFBR1','U2AF1','SF3B1','FGFR4','ERRFI1','SMARCD1','FGFR1',
'EPHB1','PDPK1','FLCN','RAD54L','MGA','PPP2R1A'))
) g on g.sample_id = cs.sample_id;
样本数据文本文件
答案 0 :(得分:0)
当此查询应该简单得多时,您似乎使该查询复杂化了。这是一个有关如何获得前三列的示例,您只需要复制粘贴并替换其余的列即可。
SELECT s.patient_id,
s.sample_id,
MAX( CASE WHEN m.gene = 'BAP1' THEN 1 ELSE 0 END) AS BAP1,
MAX( CASE WHEN m.gene = 'PDGFRA' THEN 1 ELSE 0 END) AS PDGFRA,
MAX( CASE WHEN m.gene = 'KRAS' THEN 1 ELSE 0 END) AS KRAS
FROM samples s
LEFT JOIN mutations m ON s.sample_id = m.sample_id
GROUP BY s.patient_id,
s.sample_id;
如果要动态创建此查询,可以这样做以防止编写大量代码。
DECLARE @Columns NVARCHAR(MAX),
@SQL NVARCHAR(MAX);
SELECT @Columns = ( SELECT CHAR(10) + CHAR(9) + ',MAX( CASE WHEN m.gene = ' + QUOTENAME( gene, '''') + ' THEN 1 ELSE 0 END) AS ' + QUOTENAME(gene)
FROM mutations
GROUP BY gene
FOR XML PATH(''), TYPE).value('./text()[1]', 'nvarchar(max)')
SET @SQL = N'SELECT s.patient_id ' + NCHAR(10)
+ N' ,s.sample_id '
+ @Columns + NCHAR(10)
+ N'FROM samples s ' + NCHAR(10)
+ N'LEFT JOIN mutations m ON s.sample_id = m.sample_id ' + NCHAR(10)
+ N'GROUP BY s.patient_id, ' + NCHAR(10)
+ N' s.sample_id;' + NCHAR(10)
PRINT sp_executesql --For debugging purposes
EXECUTE sp_executesql @SQL --, @ParametersDefinition, @Param1, @Param2, ..., @ParamN