我有一个像这样的制表符分隔文件(在我的脚本DIVERGE中):
contig04730 contigK02622 0.3515
contig04733 contigK02622 0.3636
contig14757 contigK03055 0.4
我有第二个制表符分隔文件,如(DATA):
contig04730 F GO:0000228 nuclear GO:0000783 telomere_cap
contig04730 F GO:0005528 reproduction GO:0001113 eggs
contig14757 P GO:0123456 immune GO:0003456 cells
contig14757 P GO:0000782 nuclear GO:0001891 DNA_binding
contig14757 C GO:0000001 immune GO:00066669 more_cells
我正在尝试将第一个文件中的第二列和第三列添加到第二列,以便我可以(OUT):
contig04730 F GO:0000228 nuclear GO:0000783 telomere_cap contigK02622 0.3515
contig04730 F GO:0005528 reproduction GO:0001113 eggs contigK02622 0.3515
contig14757 P GO:0123456 immune GO:0003456 cells contigK03055 0.4
contig14757 P GO:0000782 nuclear GO:0001891 DNA_binding contigK03055 0.4
contig14757 C GO:0000001 immune GO:00066669 more_cells contigK03055 0.4
这是我尝试使用的perl脚本(尝试调整我在这里找到的脚本 - 非常新的perl):
#!/usr/bin/env/perl
use strict;
use warnings;
#open the ortholog contig list
open (DIVERGE, "$ARGV[0]") or die "Error opening the input file with contig pairs";
#hash to store contig IDs
my ($espr, $liya, $divergence) = split("\t", $_);
#read through the ortho contig list and read into memory
while(<DIVERGE>){
chomp $_; #get rid of ending whitepace
($espr, $liya, $divergence)->{$_} = 1;
}
close(DIVERGE);
#open output file
open(OUT, ">$ARGV[2]") or die "Error opening the output file";
#open data file
open(DATA, "$ARGV[1]") or die "Error opening the sequence pairs file\n";
while(<DATA>){
chomp $_;
my ($contigs, $FPC, $GOslim, $slimdesc, $GOterm, $GOdesc) = split("\t", $_);
if (defined $espr->{$contigs}) {
print OUT "$_", "\t$liya\t$divergence", "\n";
}
}
close(DATA);
close(OUT);
但我得到的错误是第15行无用的私有变量和第10行的分割值_ $。我只对perl术语/变量有一个非常基本的把握。因此,如果有人能指出我出错的地方以及如何解决,我们将非常感激。
答案 0 :(得分:3)
这是使用Text::CSV
模块的机会。当然,为csv数据使用适当的解析器的好处是避免边缘情况破坏您的数据。
use strict;
use warnings;
use Text::CSV;
my $div = "diverge.txt"; # you can also assign dynamical names, e.g.
my $data = "data.txt"; # my ($div, $data) = @ARGV
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
sep_char => "\t",
});
my %div;
open my $fh, "<", $div or die $!;
while (my $row = $csv->getline($fh)) {
my $key = shift @$row; # first col is key
$div{$key} = $row; # store row entries
}
close $fh;
open $fh, "<", $data or die $!;
while (my $row = $csv->getline($fh)) {
my $key = $row->[0]; # first col is key (again)
push @$row, @{ $div{$key} }; # add stored values to $row
$csv->print(*STDOUT, $row); # print using Text::CSV's method
}
<强>输出:强>
contig04730 F GO:0000228 nuclear GO:0000783 telomere_cap contigK02622 0.3515
contig04730 F GO:0005528 reproduction GO:0001113 eggs contigK02622 0.3515
contig14757 P GO:0123456 immune GO:0003456 cells contigK03055 0.4
contig14757 P GO:0000782 nuclear GO:0001891 DNA_binding contigK03055 0.4
contig14757 C GO:0000001 immune GO:00066669 more_cells contigK03055 0.4
请注意,输出看起来不同,因为它是制表符分隔的,而在问题中它是以空格分隔的。
答案 1 :(得分:2)
我会做什么:
#!/usr/bin/env perl
use strict; use warnings;
open my $fh1, "<", "file1" or die $!;
open my $fh2, "<", "file2" or die $!;
my %hash;
while (<$fh1>) {
chomp;
my @F = split;
$hash{$F[0]} = join "\t", @F[1..2];
}
while (<$fh2>) {
chomp;
my @F = split;
print join("\t", $_, $hash{$F[0]}), "\n";
}
close $fh1;
close $fh2;
contig04730 F GO:0000228 nuclear GO:0000783 telomere_cap contigK02622 0.3515
contig04730 F GO:0005528 reproduction GO:0001113 eggs contigK02622 0.3515
contig14757 P GO:0123456 immune GO:0003456 cells contigK03055 0.4
contig14757 P GO:0000782 nuclear GO:0001891 DNA_binding contigK03055 0.4
contig14757 C GO:0000001 immune GO:00066669 more_cells contigK03055 0.4
答案 2 :(得分:2)
这(如果我理解你的意图正确)可以通过命令join在一行(在Linux中,至少)完成:
$ cat DATA
contig04730 F GO:0000228 nuclear GO:0000783 telomere_cap
contig04730 F GO:0005528 reproduction GO:0001113 eggs
contig14757 P GO:0123456 immune GO:0003456 cells
contig14757 P GO:0000782 nuclear GO:0001891 DNA_binding
contig14757 C GO:0000001 immune GO:00066669 more_cells
$ cat DIVERGE
contig04730 contigK02622 0.3515
contig04733 contigK02622 0.3636
contig14757 contigK03055 0.4
$ join DATA DIVERGE
contig04730 F GO:0000228 nuclear GO:0000783 telomere_cap contigK02622 0.3515
contig04730 F GO:0005528 reproduction GO:0001113 eggs contigK02622 0.3515
contig14757 P GO:0123456 immune GO:0003456 cells contigK03055 0.4
contig14757 P GO:0000782 nuclear GO:0001891 DNA_binding contigK03055 0.4
contig14757 C GO:0000001 immune GO:00066669 more_cells contigK03055 0.4
答案 3 :(得分:1)
这是另一种选择:
use strict;
use warnings;
my $data = pop;
my %diverge = map { /(\S+)\t+(.+)/; $1 => $2 } <>;
push @ARGV, $data;
while (<>) {
chomp;
$_ .= "\t$diverge{$1}\n" if /(\S+)/ and $diverge{$1};
print;
}
用法:perl DIVERGE_File DATA_File [>outFile]
数据集输出:
contig04730 F GO:0000228 nuclear GO:0000783 telomere_cap contigK02622 0.3515
contig04730 F GO:0005528 reproduction GO:0001113 eggs contigK02622 0.3515
contig14757 P GO:0123456 immune GO:0003456 cells contigK03055 0.4
contig14757 P GO:0000782 nuclear GO:0001891 DNA_binding contigK03055 0.4
contig14757 C GO:0000001 immune GO:00066669 more_cells contigK03055 0.4