Question

我有一个数据库转储文件。字段终止符是“\ t | \ t”。我正在尝试使用此代码获取前2个字段（tax_id和parent tax_id）：

代码：

while(my $line = <INPUT>) {   
    my ($taxid, $parentid, $rank, $embl, $div, $inherdiv, $mito, $inhermito, $gbflag, $subtree, $comment)  = split (/\|/, $line);
    $taxid =~ s/^\t$//g;  
    $parentid =~ s/^\t$//g;
    print $taxid."_".$parentid."\n";
}

示例输出：

69223   _       204037
69224   _       551

当我使用替换函数s /// g时，似乎没有清理制表符分隔符。有什么想法吗？有没有更好的方法来清理字段中的每个值

Answer 1

我没有尝试手动解析这些东西，而是尝试使用Text :: CSV。

use Text::CSV;

my $csv->new({
    binary => 1,            # just always do this
    eol => "\n",            # end of line char
    sep_char => "|",        # separator
    allow_whitespace => 1   # Auto trim tabs and spaces when parsing
});

open my $fh, '<', $path_to_db_dump
    or die "Can't open $path_to_db_dump - $!\n";

my @headers = qw/
    taxid   parentid
    rank    embl
    div     inherdiv
    mito    inhermito
    gbflag  subtree
    comment
/;
$csv->column_names( @headers );

# skip to the place in the file where data lines live

while ( my $row = $csv->get line_hr($fh) ) {

    print "$row->{taxid}_$row->{parentid}\n";

}

如果您提供了原始数据样本，则此代码可能会更加具体。

Answer 2

拆分完整的分隔符而不仅仅是其中的一部分：

my ($taxid, $parentid, $rank, $embl, $div, $inherdiv, $mito, $inhermito, $gbflag, $subtree, $comment)
    = split "\t\\|\t", $line;

然后没有必要在后面清理你的数据。

Answer 3

如果使用当前的解决方案，您想要抛弃“^”元字符，这意味着“以...开头”。

你想要

$taxid =~ s/\t//g;

实施例

my $str = "|\tHi\t|";
print "$str\n";
$str=~ s/\t//g;
print "$str\n";

输出：

|   Hi  |
|Hi|

Answer 4

尝试使用一般的空白字符匹配：

$taxid =~ s/^\s*$//g;  
$parentid =~ s/^\s*$//g;

在Perl中解析数据库转储文件

4 个答案:

在Perl中解析数​​据库转储文件

4 个答案:

在Perl中解析数据库转储文件