我正在尝试在perl中使用某种指针,以便我可以查看两个按字母顺序排序的两个文件,如果它们在第一列中具有相同的名称,则匹配两个文件中的内容。我正在搜索每个文件的方式是我正在查看第一列按字母顺序排在哪一行,然后将该文件上的指针移动到下一行。有点类似于合并排序中的指针。下面的代码是我想要的一个例子。
使用这两个文件。
SET1
apple 17 20
boombox 23 29
carl 25 29
cat 22 33
dog 27 44
SET2
ants yes
boombox no
carl yes
dentist yes
dice no
dog no
我可以创建一个类似这样的脚本
($name, $affirmation) = first line in set2; #part I'm confused about I just kind of need some sort of command of something that will do this
while (<>){
@set1 = split;
while ($name < set1[0]){
($name, $affirmation) = next line in set2; # part i'm confused about I just kind of need some sort of command of something that will do this
}
if ($name = $set[0]{
print @set1, $affirmation;
}
这是我如何运行它
./script.txt set1
我最终会以
结束boombox 23 29 no
carl 25 29 yes
dog 27 44 no
编辑:
我在一些答案中尝试了一些代码,看看我是否可以用它来制作一些功能代码,但我似乎遇到了问题,而且答案中的一些语法我无法理解所以我有找出如何调试或解决这个问题很麻烦。
这是我使用以下两个文本文件的具体示例
的text.txt
Apples 0 -1 -1 0 0 0 0 -1
Apricots 0 1 1 0 0 0 0 1
Fruit 0 -1 -1 0 0 0 0 -1
Grapes 0 -2 -1 0 0 0 0 -2
Oranges 0 1 1 0 0 0 0 -1
Peaches 0 -2 -1 0 0 0 0 -2
text2.txt
Apples CHR1 + 1167628 1170420 1 1 N
Apricots CHR1 - 2115898 2144159 1 1 N
Oranges CHR1 - 19665266 19812066 1 1 N
Noidberry CHR1 - 1337728 1329993 1 1 N
Peaches CHR1 - 1337275 1342693 1 1 N
这个脚本 script.pl
#!/usr/bin/perl
use warnings;
my $file_1 = $ARGV[0];
my $file_2 = $ARGV[1];
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";
open(my $single, '>', 'text.txt');
open(my $deep, '>', 'text2.txt');
OUTER: while (my $outer = <$fh1>){
chomp $outer;
@CopyNumber = split(' ', $outer);
($title, $title2) = split('\|', $CopyNumber[0]);
#print 'title: ',$title,' title2: ',$title2,"\n";
my $numLoss = 0;
my $deepLoss = 0;
for ($i = 1; $i <= $#CopyNumber; $i++){
#print "$CopyNumber[$i], $#CopyNumber, $i, \n";
if ($CopyNumber[$i] < 0){
$numLoss = $numLoss + 1;
if ($CopyNumber[$i] <-1){
$deepLoss = $deepLoss + 1;
}
}
}
if ($GeneSym and (($GeneSym cmp $title)==0)){ #or (($GeneSym cmp $title2)==0))){
print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
next OUTER;
}
INNER: while (my $inner = <$fh2>){
($GeneSym, $Chrom, $Strand, $Start, $Stop, $MapId, $TotalMap, $AbnormalMerge, $Overlap) = split(' ', $inner);
$Chrom =~ s/CHR/hs/ee;
my $cmp = ($GeneSym cmp $title);
next OUTER if $cmp < 0;
if ($cmp==0){ #or (($GeneSym cmp $title2)==0)){
print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
next OUTER;
}
}
}
如果我运行./script.pl text.txt text2.txt我应该打印到Number.txt
//对应于text2.txt的第2,4,5列,最后一列是数字小于0的列的百分比
hs1 1167628 1170420 0.375 //For Apples
hs1 2115898 2144159 0 //For Apricots
hs1 19665266 19812066 0.125 //For Oranges
hs1 1337275 1342693 0.375 //For Peaches
相反,我得到了这个
hs1 1167628 1170420 0.375
hs1 2115898 2144159 0
hs1 1337275 1342693 0.375
所以我只是得到一个错误
hs1 19665266 19812066 0.125 //For Oranges
未打印
答案 0 :(得分:3)
非常像您所述:使用cmp
进行比较,split
分为两个术语。
对于FILE1
文件的每一行,请查看FILE2
文件的行,找到匹配项后退出。一旦FILE2
按字母顺序移动到下一行FILE1
。
use warnings 'all';
use strict;
sub process {
my ($name, $affirm_1, $affirm_2) = @_;
print "$name $affirm_1 $affirm_2\n";
}
my $file_1 = 'set1.txt';
my $file_2 = 'set2.txt';
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";
my ($name_2, $affirm_2);
FILE1: while (my $line1 = <$fh1>) {
chomp $line1;
my ($name_1, $affirm_1) = split ' ', $line1, 2;
if ($name_2) {
my $cmp = $name_1 cmp $name_2;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process($name_1, $affirm_1, $affirm_2);
next FILE1;
}
}
FILE2: while (my $line2 = <$fh2>) {
chomp $line2;
($name_2, $affirm_2) = split ' ', $line2, 2;
my $cmp = $name_1 cmp $name_2;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process($name_1, $affirm_1, $affirm_2);
next FILE1;
}
}
}
对一些剩余细节的评论。
一旦FILE2
行&#34;过冲,&#34;在FILE1
的下一次迭代中,我们需要首先检查该行,然后进入FILE2
循环以迭代剩余的行。对于第一个FILE1
行,$name_2
仍未取消if ($name_2)
。
针对已修改的帖子进行了更新。
use warnings 'all';
use strict;
sub process_line {
my ($single, $deep, $rline, $GeneSym, $Chrom, $Start, $Stop) = @_;
my ($numLoss, $deepLoss) = calc_loss($rline);
$Chrom =~ s/CHR/hs/;
print $single (join "\t", $Chrom, $Start, $Stop, $numLoss/$#$rline), "\n";
print $deep (join "\t", $Chrom, $Start, $Stop, $deepLoss/$#$rline), "\n";
}
sub calc_loss {
my ($rline) = @_;
my ($numLoss, $deepLoss) = (0, 0);
for my $i (1.. $#$rline) {
$numLoss += 1 if $rline->[$i] < 0;
$deepLoss += 1 if $rline->[$i] < -1;
}
return $numLoss, $deepLoss;
}
my ($Number, $NumberDeep) = ('Number.txt', 'NumberDeep.txt');
open my $single, '>', $Number or die "Can't open $Number: $!";
open my $deep, '>', $NumberDeep or die "Can't open $NumberDeep: $!";
my ($file_1, $file_2) = ('set1_new.txt', 'set2_new.txt');
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";
my ($GeneSym, $Chrom, $Strand, $Start, $Stop,
$MapId, $TotalMap, $AbnormalMerge, $Overlap);
FILE1: while (my $line1 = <$fh1>) {
next if $line1 =~ /^\s*$/;
chomp $line1;
my @line = split ' ', $line1;
if ($GeneSym) {
my $cmp = $line[0] cmp $GeneSym;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process_line($single, $deep, \@line,
$GeneSym, $Chrom, $Start, $Stop);
next FILE1;
}
}
FILE2: while (<$fh2>) {
next if /^\s*$/;
chomp;
($GeneSym, $Chrom, $Strand, $Start, $Stop,
$MapId, $TotalMap, $AbnormalMerge, $Overlap) = split;
my $cmp = $line[0] cmp $GeneSym;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process_line($single, $deep, \@line,
$GeneSym, $Chrom, $Start, $Stop);
next FILE1;
}
}
}
这会使用给定的示例文件生成所需的输出。我们会采取一些快捷方式,如果评论有用,请告诉我。这里有几个
应该添加很多错误检查。
我假设FILE1
的第一个字段按原样使用。否则需要进行更改。
处理分为两个功能,计算是分开的。这不是必需的。
$#$rline
是$rline
arrayref的最后一个元素的索引。如果使用@$rline - 1
语法过多,例如(0..@$rline-1)
对问题中发布的代码的一些评论:
始终始终,请use warnings;
(和use strict;
)
循环索引最好写成foreach my $i (0..$#array)
正则表达式修饰符/ee
非常复杂。这里绝对没有必要。
答案 1 :(得分:2)
你是对的。它与合并排序完全相同,只是输出匹配的行。
sub read_and_parse1 {
my ($fh) = @_;
defined( my $line = <$fh> )
or return undef;
my ($id, @copy) = split(' ', $line); # Use split(/\t/, $line) if tab-separated data
my ($gene_sym) = split(/\|/, $id);
return [ $gene_sym, @copy ];
}
sub read_and_parse2 {
my ($fh) = @_;
defined( my $line = <$fh> )
or return undef;
return [ split(' ', $line) ]; # Use split(/\t/, $line) if tab-separated data
}
my $fields1 = read_and_parse1($fh1);
my $fields2 = read_and_parse2($fh2);
while ($fields1 && $fields2) {
my $cmp = $fields1->[0] cmp $fields2->[0];
if ($cmp < 0) { $fields1 = read_and_parse1($fh1); }
elsif ($cmp > 0) { $fields2 = read_and_parse2($fh2); }
else {
my ($gene_sym, @copy) = @$fields1;
my (undef, $chrom, $strand, $start, $stop, $map_id, $total_map, $abnormal_merge, $overlap) = @$fields2;
$chrom =~ s/^CHR/hs/;
my $num_loss = grep { $_ < 0 } @copy;
my $deep_loss = grep { $_ < -1 } @copy;
print($single_fh join("\t", $chrom, $start, $stop, $num_loss/@copy ) . "\n");
print($deep_fh join("\t", $chrom, $start, $stop, $deep_loss/@copy ) . "\n");
$fields1 = read_and_parse1($fh1);
$fields2 = read_and_parse2($fh2);
}
}
输出:
$ cat single.txt
hs1 1167628 1170420 0.375
hs1 2115898 2144159 0
hs1 19665266 19812066 0.125
hs1 1337275 1342693 0.375
$ cat deep.txt
hs1 1167628 1170420 0
hs1 2115898 2144159 0
hs1 19665266 19812066 0
hs1 1337275 1342693 0.25