我是一个perl初学者,请帮我解决我的问题......我正在尝试从一个爆炸表中提取信息(下面是一个如下所示的片段): 这是一个标准的爆炸表输入......我基本上想要在读取列表中提取任何信息(查看我下面的第二个脚本,以了解我想要做什么)....无论如何这正是我的意思已经完成了第二个脚本:
输入:
1)爆破表:
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
34.3 7.5 123828 GH8NFLV01A03QJ GH8NFLV01A03QJ rank=0239249 x=305.0 y=1945.5 length=452 1 XP_002639994 Hypothetical protein CBG10824 [Caenorhabditis briggsae] 3 0 52.1739130434783 32.6086956521739 46 452 367 10.1769911504425 12.5340599455041 111 248 79 124
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
37.7 0.98 33114 GH8NFLV01A0H5C GH8NFLV01A0H5C rank=0066011 x=298.0 y=2638.5 length=573 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] -3 0 73.5294117647059 52.9411764705882 34 573 703 5.93368237347295 4.83641536273115 131 232 273 306
103 1e-020 65742 GH8NFLV01A0MXI GH8NFLV01A0MXI rank=0124865 x=300.5 y=644.0 length=475 1 ABZ08973 hypothetical protein ALOHA_HF4000APKG6B14ctg1g18 [uncultured marine crenarchaeote HF4000_APKG6B14] 2 0 77.9411764705882 77.9411764705882 68 475 151 14.3157894736842 45.0331125827815 2 205 1 68
41.6 0.053 36083 GH8NFLV01A0QKX GH8NFLV01A0QKX rank=0071366 x=301.0 y=1279.0 length=526 1 XP_766153 hypothetical protein [Theileria parva strain Muguga] -1 0 66.6666666666667 56.6666666666667 30 526 304 5.70342205323194 9.86842105263158 392 481 31 60
45.4 0.003 78246 GH8NFLV01A0Z29 GH8NFLV01A0Z29 rank=0148293 x=304.0 y=1315.0 length=432 1 ZP_04111769 hypothetical protein bthur0007_56280 [Bacillus thuringiensis serovar monterrey BGSC 4AJ1] 3 0 51.8518518518518 38.8888888888889 54 432 193 12.5 27.979274611399 48 209 97 150
71.6 4e-011 97250 GH8NFLV01A14MR GH8NFLV01A14MR rank=0184885 x=317.5 y=609.5 length=314 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 92.5 92.5 40 314 311 12.7388535031847 12.8617363344051 193 312 13 52
58.2 5e-007 154555 GH8NFLV01A1KCH GH8NFLV01A1KCH rank=0309994 x=310.0 y=2991.0 length=267 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 82.051282051282 82.051282051282 39 267 311 14.6067415730337 12.540192926045 142 258 1 39
2)阅读清单:
GH8NFLV01A09B8
GH8NFLV01A02ED
etc
etc
3)我想要的输出:
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
我想要第一个列表中的信息子集,给出我想要提取的读取名称列表(可在第4列中找到) 我没有对读取列表进行散列(仅?),而是想对爆破表本身进行散列,并使用第4列(爆破表)中的信息作为提取每个键值的键,即使该键可能有更多比一个值(即:每个读取的名称实际上可能有多个匹配,或表中关联的blast结果),请记住,该值包括其中包含该键(readname)的WHOLE行。
我的greplist.pl脚本执行此操作,但我认为非常慢,(如果我错了,请纠正我)通过将整个表加载到哈希中,这应该会极大地加速...
感谢您的帮助。
我的脚本: 破碎的那个(mambo5.pl)
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash = <DATA>;
close (DATA);
my $filename=$ARGV[0];
open(OUT, "> $filename.bololom");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "bingo!";
my $output =$hash{$readName};
print OUT "$output\n";
}
else
{
print "it aint workin\n";
#print %hash;
}
}
}
close (LIST);
缓慢而快速的作弊(有效)并且非常慢(我的爆炸表可以大约400MB到2GB,我相信你可以看到它为什么这么慢)
#!/usr/bin/perl -w
##
# This script finds a list of names in a blast table and outputs the result in a new file
# name must exist and list must be correctly formatted
# will not output anything using a "normal" blast file, must be a table blast
# if you have the standard blast output use blast2table script
use strict;
my $filein=$ARGV[0] or die ("usage: ./listgrep.pl readslist blast_table\n");
my $db=$ARGV[1] or die ("usage: ./listgrep.pl readslist blast_table\n");
#open the reads you want to grep
my $read;
my $line;
open(READSLIST,$filein);
while($line=<READSLIST>)
{
if ($line=~/^(.*)$/)
{
$read = $1;
print "$read\n";
system("grep \"$read\" $db >$read\_.out\n");
}
#system("grep $read $db >$read\_.out\n");
}
system("cat *\_.out >$filein\_greps.txt\n");
system("rm *.out\n");
我不知道如何将第4列定义为键:也许我可以使用split函数,但我试图找到一种方法来为超过2列的表做到这一点无济于事。 .. 请帮忙! 如果有一个简单的方法,请告诉我
谢谢!
答案 0 :(得分:4)
我会做相反的事情,即将读取列表文件读入散列,然后通过大爆炸文件并打印所需的行。
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
# Read the readslist file into a hash
open my $fh, '<', 'readslist' or die "Can't open 'readslist' for reading:$!";
my %readslist = map { chomp; $_ => 1 }<$fh>;
close $fh;
open my $fh_blast, '<', 'blastfile' or die "Can't open 'blastfile' for reading:$!";
# loop on all the blastfile lines
while (<$fh_blast>) {
chomp;
# retrieve the key (4th column)
my ($key) = (split/\s+/)[3];
# print the line if the key exists in the hash
say $_ if exists $readslist{$key};
}
close $fh_blast;
答案 1 :(得分:0)
我建议您构建一个索引,将 blasts 文件临时转换为索引顺序文件。通读它并在文件中构建一个地址散列,其中每个键的每个记录都会开始。
之后只需要seek
文件中的正确位置来获取所需的记录。这肯定比大多数简单的解决方案更快,因为它只需要读取一次大文件。此示例代码演示了。
use strict;
use warnings;
use Fcntl qw/SEEK_SET/;
my %index;
open my $blast, '<', 'blast.txt' or die $!;
until (eof $blast) {
my $place = tell $blast;
my $line = <$blast>;
my $key = (split ' ', $line, 5)[3];
push @{$index{$key}}, $place;
}
open my $reads, '<', 'reads.txt' or die $!;
while (<$reads>) {
next unless my ($key) = /(\S+)/;
next unless my $places = $index{$key};
foreach my $place (@$places) {
seek $blast, $place, SEEK_SET;
my $line = <$blast>;
print $line;
}
}
答案 2 :(得分:0)
Voila,有两种做法,一种与perl无关:
awk 'BEGIN {while ( i = getline < "reads_list") ar[$i] = $1;} {if ($4 in ar) print $0;}' blast_table > new_blast_table
Mambo6.pl
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames. HINT: Make sure your list file only has unique names , that way you save time.
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash;
my $val;
my $key;
while (<DATA>)
{
#chomp;
if(/((.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?))$/)
{
#print "$1\n";
$key= $5;#read
$val= $1;#whole row; notice the brackets around the whole match.
$hash{$key} .= exists $hash{$key} ? "$val\n" : $val;
}
else {
print "something wrong with format";
}
}
close (DATA);
open(OUT, "> $ARGV[1]\_out\.txt");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "$readName\n";
my $output =$hash{$readName};
print OUT "$output";
}
else
{
#print "it aint workin\n";
}
}
}
close (LIST);
close (OUT);
oneliner更快,可能比我的脚本更好,我相信有些人可以找到更简单的方法来做到这一点......我只是觉得我已经把它放了,因为它做了我想要的。