Question

这是制表符分隔的文件：

Chr Start   End Ref Alt Func.refGene    Gene.refGene    GeneDetail.refGene  ExonicFunc.refGene  AAChange.refGene    snp138  clinvar_20140929    SIFT_score  SIFT_pred   Polyphen2_HDIV_score    Polyphen2_HDIV_pred Polyphen2_HVAR_score    Polyphen2_HVAR_pred LRT_score   LRT_pred    MutationTaster_score    MutationTaster_pred MutationAssessor_score  MutationAssessor_pred   FATHMM_score    FATHMM_pred RadialSVM_score RadialSVM_pred  LR_score    LR_pred VEST3_score CADD_raw    CADD_phred  GERP++_RS   phyloP46way_placental   phyloP100way_vertebrate SiPhy_29way_logOdds
chr13   52523808    52523808    C   T   exonic  ATP7B       nonsynonymous SNV   ATP7B:NM_000053:exon12:c.2855G>A:p.R952K,ATP7B:NM_001243182:exon13:c.2522G>A:p.R841K    rs732774    CLINSIG=non-pathogenic|non-pathogenic;CLNDBN=Wilson's_disease|not_specified;CLNREVSTAT=single|single;CLNACC=RCV000029357.1|RCV000078044.1;CLNDSDB=GeneReviews:MedGen:OMIM:Orphanet:SNOMED_CT|.;CLNDSDBID=NBK1512:C0019202:277900:ORPHA905:88518009|.    0.99    T   0.04    B   0.03    B   0.000   N   0.000   P   -1.04   N   -3.73   D   -0.965  T   0.000   T   0.214   1.511   11.00   6.06    1.111   2.781   12.356
chr13   52523867    52523867    T   G   exonic  ATP7B       synonymous SNV  ATP7B:NM_000053:exon12:c.2796A>C:p.S932S,ATP7B:NM_001243182:exon13:c.2463A>C:p.S821S

我有一个bash脚本，它将ABI文件作为输入，并使用ANNOVAR来注释变体。生成制表符分隔的文本文件，其中包含带注释的变体。因此，每次为不同的ABI文件执行bash脚本时，列的数量都会在制表符分隔的文件中修复，但每个结果变量的行数和单个注释可能会有所不同。

到目前为止的尝试 - ＆gt;

我尝试编写一个bash脚本，从标签分隔的文本文件中提取[第一个变体]不同的字段，将其保存为文本文件，将所有生成的文本单个文件组合在一起，并使用AWK脚本分配不同的变量到组合文本文件中的每个字段。我使用AWK创建了HTML页面，并在AWK脚本中使用这些变量在HTML中的相应标签中打印，并且它适用于在制表符分隔的文本文件中遵循相同模式的文件。但是当具有不同模式的其他带注释结果的特定字段不存在时，脚本将打印与为其分配的变量不同的字段。

如果第一个变体包含临床显着突变，那么＆＃34; clinvar＆＃34;中将存在注释。列，因此需要在不同的部分报告以及其他细节。

组合文本文件的顺序对于每个变体都不相同，因此为其生成的报告不正确。

预期结果 - ＆gt;

由于制表符分隔文件的格式不统一，有没有办法可以为每一行设置多个条件，例如如果特定列[例如：clinvar]有一个值，则将其打印出来在HTML标记之间，如果它不存在，则检查另一列[例如：rsID]，如果存在值，则将其打印在其他一些HTML标记中，依此类推其他列！

变体位置：chr13：52523808C＆gt; T

变体类型：非同义-SNV

rsID ：rs732774

氨基酸变化：p.R952K

基因名称：ATP7B

疾病：威尔逊病

结果：非致病性

HTML页面的格式及其中的值应如下所示：

<html>
<title></title><head>
<style type="text/css">
body {background-color:lightgray}
h1   {background-color:SlateGray}
</style>
</head><body bgcolor="LightGray">
<table border=1><th align=>Test Code</th><th align=>Gene Name</th><th align=>Condition tested</th><th align=>Result</th>
<tr><td width=750 align=></td><td width=750 align=>ATP7B(RefSeq ID: NM_000053)</td><td width=750 align=>Wilson's_disease</td><td width=750 align=>Non-pathogenic</td></tr>
<h1 align=>Test Details</h1>
<table border=1><th align=centre>Genomic Location of Mutation</th><th align=centre>Mutation Type</th><th align=centre>dbSNP Identifier</th><th align=centre>Amino Acid Change</th><th align=centre>OMIM Identifier</th>
<h1 align=>Significant Findings</h1>
<tr><td width=750 align=>chr13:52523808C>T</td><td width=750 align=>Nonsynonymous-SNV</td><td width=750 align=>rs732774</td><td width=750 align=>p.R952K</td><td width=750 align=>http://www.omim.org/entry/277900</td></tr>
<p> The identified variant is located in the <strong> exonic </strong> region of the <strong> chr13 </strong> chromosome and is a <strong> Nonsynonymous-SNV </strong> which causes an amino acid change from <strong> Arginine </strong> to <strong> Lysine </strong>. The mutation has also been reported in the dbSNP database (http://www.ncbi.nlm.nih.gov/SNP/) with an accession number of <strong> rs732774 </strong>. </p>
</table></body>
</html>

以类似的方式，当存在新的变体时，其中ExonicFunc.refGene列包含＆＃34;非同义词＆＃34;并且snp138列中没有值，那么它应该打印SIFT_score以及HTML标记之间的其他细节。这些只是一些需要的条件，但是如果有人可以就如何解决所有问题提出一个想法，那将会非常有用!!!

感谢您阅读这么长的问题，我们将非常感谢您对此问题的任何帮助。

Answer 1

我在这里向您展示的awk程序，将所有标题和所有数据分成相应的行。我认为你可以修改它来定制你的需求。请记住，你所拥有的所有棘手的规则 - 当它没有出现时，表明相反 - 更好地自己实现而不是要求实现。

#
# processor.awk
#


BEGIN   {
        IGNORECASE = 1; 
        header = ""; 
        html_template = "<tr><td>##fieldname</td><td>##fieldvalue</td></tr>"
        }
        {
        if( header == "" && $0 != "" )
        {   # the first not empty line is the header
            header = $0; 
            # put every element of the header into an array
            split( header, fields, "\t" );
            # for debug: print the fields found
            #for( elem in fields ) 
            #   print "field" elem ": " fields[elem];
        } # if 
        else
        {
            # normal lines
            # split the line into the elements 
            split( $0, content, "\t" ); 
            # for every element in the content line....
            for( elem = 1; fields[elem] !=""; elem++ )
            {
                print elem;
                out_line = html_template; 
                out_line = gensub( /##fieldname/, fields[elem], "g", out_line );
                out_line = gensub( /##fieldvalue/, content[elem], "g", out_line ); 
                # print the result
                print out_line;
            } # for 
        } # if 
        }
END     {
        }

将制表符分隔的文本文件转换为HTML / PDF / latex / knitr报告

1 个答案: