Perl未初始化的值哈希查找基因符号

时间:2015-11-05 15:40:34

标签: regex perl hash initialization

更新(2):

更改了代码以丢弃标题中的注释,但仍然在哈希键/值赋值中遇到语法:

./convertDataToGeneSymbol.pl第99行的

语法错误,靠近“$ geneSymbolToGo {” 语法错误位于./convertDataToGeneSymbol.pl第101行,靠近“}”

我似乎无法在代码中发现任何错误,所以我认为数组无法读取$ go的值?

这是输入文件3的标题:

!10-20行评论

UniProtK / t BA0A021WW37 / t CG17167 / t GO:0016021 / t GO_REF:0000038
(仍然学习如何在这个网站上格式化; / t意味着标签分离)

P.S。对此评论感到抱歉。我的教授需要对我们的课程进行广泛的评论。严格一直给我一些关于这个程序的问题(主要是由于我的经验不足),但当我删除它时,我得到了我想要的结果。到目前为止,感谢您提供所有帮助!

#!/usr/bin/perl
use warnings;
use diagnostics;

# Title: convertDataToGeneSymbol.pl
# Author: Nicholas Bense
# Date: 11/4/15

# Open a filehandle to read file #1
open(INF1,"<",'/scratch/Drosophila/fb_synonym_fb_2014_05.tsv' ) or die $!;

# Open a filehandle to read file #2
open(INF2,"<",'/scratch/Drosophila/FlyRNAi_data_baseline_vs_EGF.txt') or die $!;

# Open a filehandle to read file #3
open(INF3,"<",'/scratch/Drosophila/gene_association.goa_fly') or die $!;

# Open a filehandle to write new file
open(OUTF1,">",'FlyRNAi_data_baseline_vs_EGFSymbol.txt') or die $!;

# Open a filehandle to write new file
open(OUTF2,">",'FlyRNAi_data_baseline_vs_EGF_GO.txt') or die $!;

# Initialize a hash for the gene symbol conversion
my %geneSymbolConversion;

# Read input file 1 line by line
while (<INF1>){

# Get rid of whitespace
        chomp;

# Split the line
        my @inf1Array = split("\t", $_);

# Filter entries starting with FBgn
        if ($inf1Array[0] =~ /(^FBgn\d+)/){

# Assign column 1 to hash key scalar
        my $geneID = $inf1Array[0];

# Assign column 2 to hash value scalar
        my $geneSymbol = $inf1Array[1];

# Assign key and value to hash
        $geneSymbolConversion{$geneID} = $geneSymbol;

}

}

# Discard first line of input file 2
<INF2>;

# Read input file 2 line by line
while (<INF2>){


        # Get rid of whitespace
        chomp;

        # Split the line on tabs
        my ($geneID, $egf_Baseline, $egf_Stimulus) = split("\t", $_);

        # Check if the codon is present in the hash
        if (defined $geneSymbolConversion{$geneID}){

                # Get the value associated with the codon from the hash
                $geneSymbol = $geneSymbolConversion{$geneID};
        }

        # Join data and print to output file
        print OUTF1 join( "\t", $geneSymbol, $egf_Baseline, $egf_Stimulus), "\n";
}

# Initialize hash for GO conversion
my %geneSymbolToGo;

<INF3>;

# Read input file 3 line by line
while (<INF3>){

        # Get rid of whitespace
        chomp;

        # Discard comment lines
        if ($_ !~ /!/){

        # Split the line on tabs
        my @inf3Array = split("\t", $_);

        # Assign column 3 to hash key scalar
        my $geneSymbol = $inf3Array[2];

        # Assign column 4 to hash value scalar
        my $go = $inf3Array[3];

        # Assign key and value to hash
        my $geneSymbolToGo{$geneSymbol} = $go;
        }
}

# Open a filehandle to read file #3
open(INF4,"<",'FLYRNAi_data_baseline_vs_EGFSymbol.txt') or die $!;

# Read input file 4 line by line
while (<INF4>){

        # Remove end of line characters
        chomp;

        # Split the line on tabs
        my ($geneSymbol, $egf_Baseline, $egf_Stimulus), "\n";

        # Check if the gene symbol is present in the hash
        if (defined $geneSymbolToGo{$geneSymbol}){

                # Get the value associated with the codon from the hash
                $go = $geneSymbolToGo{$geneSymbol};

        }

        # Join data and print to output file
        print OUTF2 join( "\t", $go, $egf_Baseline, $egf_Stimulus), "\n";
}

1 个答案:

答案 0 :(得分:1)

  • 始终

    use strict;
    use warnings 'all';
    

    每个 Perl程序的开头。除非您无法理解这两个错误消息,否则use diagnostics不太有用

  • 如果要执行许多磁盘操作,那么use autodie有助于避免在每次操作后编写合理的代码来捕获任何错误,例如or die $!

  • 始终使用词法文件句柄。例如

    open my $inf1_fh, '<', '/scratch/Drosophila/fb_synonym_fb_2014_05.tsv'
    

    并更好地命名。您的代码有两个极端,对基本数据使用过于冗长的geneSymbolConversion,但文件句柄使用INF1INF2等。我不了解您的申请,但我确定不会想到反映该文件目的的内容并添加_fh来表示它&# 39; sa文件句柄

  • 如果您使用以本地变量的大写字母开头的标识符,则可能会出现问题。熟悉Perl的人也会感谢你在名称中避免使用大写字母 ,并使用 snake case ,因此%geneSymbolConversion更好地写成{{ 1}}

    您的标识符也太长了。我们可以将此哈希的名称进一步缩写为%gene_symbol_conversion而不含歧义

  • %conversion的第一个参数是正则表达式,第二个参数的默认值是split,所以最好写

    $_

    作为

    split("\t", $_)
    
  • 您的正则表达式split /\t/ 会捕获匹配的字符串,但从不使用捕获,因此您应该只编写/(^FBgn\d+)/

  • 我不明白你在/^FBgn\d+/循环中做了什么

    while

    因为while ( $INF1Array[0] =~ /(^FBgn\d+)/ ) { ... } (应该是$INF1Array[0])永远不会在循环体中更改,所以它永远不会终止。我的猜测是$inf1_array[0]应该是while

  • 使用Perl的定义或运算符。而不是

    if
    你应该

    my $geneSymbol = "NA";
    
    if ( defined $geneSymbolConversion{$geneID} ) {
        $geneSymbol = $geneSymbolConversion{$geneID};
    }
    

这是我写更多Perlish和可用的东西。它远不是一个复杂的程序,所以我认为它根本不需要任何评论。他们所采用的垂直空间比他们在解释中所弥补的更明显是一个障碍

my $gene_symbol = $conversion{$gene_id} // 'NA'