脚本在两个文件中查找匹配的字段1和5,打印这些行的两个文件的字段组合

时间:2015-08-18 18:14:50

标签: perl awk

我想找到文件1中的字段1和5与文件2中的字段1和5匹配的行,以及这些行,从文件1和字段7,8打印字段3,6,5和7,和文件2中的9。

文件1:

 5     49841950  rs201370260      5     49841950  rs201370260            1 
 5     49841950  rs201370260      5     49841652   rs75811775     0.983883 
 5     49841950  rs201370260      5     49694713  rs200980145     0.899981 
 5     49841950  rs201370260      5     49694713    rs1052977     0.894315 

文件2:

5 5_49841950_D I2 D 49841950 0.882 1.05876 0.0112 3.69E-7 0
5 rs28680688 C G 12114 0.842 0.98738 0.0131 0.3326 0
5 5_49694713_I I2 D 49694713 0.864 1.05306 0.0117 9.224E-6 0
5 rs1052977 A T 49694713 0.982 1.05043 0.0107 4.477E-6 0

我有一个脚本来执行此操作:

#! perl -w                                                                                                              
use strict;
use warnings;

my @loci;
open( my $loci_in, "<", "File 2" ) or die $!;
while (<$loci_in>) {
    my ( $chr, $rsID, $A1, $A2, $bp, $info, $or, $se, $p, $ngt ) = split;
    next if m/hg19chrc/;
    push @loci, [$chr, $rsID, $A1, $A2, $bp, $info, $or, $se, $p, $ngt];
}
close $loci_in;

my $filename = shift @ARGV;
open( my $input, "<", "File 1") or die $!;
print "rsID1 rsID2 bp2 r2 or se p\n";
while (<$input>) {
    next if m/chr/;
        my ( $chr1, $bp1, $rsID1, $chr2, $bp2, $rsID2, $r2 ) = split;
    foreach my $locus (@loci) {
        if (    $chr2 =~ /^$locus->[0]$/
                and $bp2 =~ /^$locus->[4]$/)                                                                            
        {
            print "$rsID1 $rsID2 $bp2 $r2 $locus->[6] $locus->[7] $locus->[8]\n";
            next;
        }
    }
}
close($input);

当有多个匹配字段1和5时,我遇到了问题。例如,文件1条目

 5     49841950  rs201370260      5     49694713  rs200980145     0.899981 
 5     49841950  rs201370260      5     49694713    rs1052977     0.894315 

匹配两个文件2条目:

5 5_49694713_I I2 D 49694713 0.864 1.05306 0.0117 9.224E-6 0
5 rs1052977 A T 49694713 0.982 1.05043 0.0107 4.477E-6 0

因此输出有4条输出线,它应该只有两条:

rs201370260 rs200980145 49694713 0.899981 1.05306 0.0117 9.224E-6
rs201370260 rs200980145 49694713 0.899981 1.05043 0.0107 4.477E-6
rs201370260 rs1052977 49694713 0.894315 1.05306 0.0117 9.224E-6
rs201370260 rs1052977 49694713 0.894315 1.05043 0.0107 4.477E-6

所需的输出将是:

rs201370260 rs200980145 49694713 0.899981 1.05306 0.0117 9.224E-6
rs201370260 rs1052977 49694713 0.894315 1.05043 0.0107 4.477E-6

有没有人有perl或awk解决方案?

3 个答案:

答案 0 :(得分:0)

根据您的描述,您只需要:

$ cat tst.awk
NR==FNR { file1[$1,$5] = $3 OFS $6 OFS $7; next }
($1,$5) in file1 { print file1[$1,$5], $7, $8, $9 }

$ awk -f tst.awk file1 file2
rs201370260 rs201370260 1 1.05876 0.0112 3.69E-7
rs201370260 rs1052977 0.894315 1.05306 0.0117 9.224E-6
rs201370260 rs1052977 0.894315 1.05043 0.0107 4.477E-6

但我不知道为什么你的预期输出是你在问题中显示的2行,而不是上面的3行。您需要编辑您的问题,以告诉我们您希望如何解决file1和file2中的重复键。

答案 1 :(得分:0)

以下是否符合您的要求?我玩的时候不是很优雅。请注意,数组从0开始,因此您指定的每个“字段”数字都减少了1.

use warnings;
use strict;

open my $fh1, '<', 'f1.txt' or die $!;
open my $fh2, '<', 'f2.txt' or die $!;

my @fh2 = <$fh2>;

while (<$fh1>){

    my $fh2_elem = shift @fh2;

    my $same = 0;

    for my $col (qw(0 4)){
        if(((split)[$col]) eq ((split /\s+/, $fh2_elem)[$col])){
            $same = 1;
        }
        last if ! $same;
    }

    if ($same){
       print( (join ' ', (split)[2,5,4,6], (split /\s+/, $fh2_elem)[6,7,8]) );
       print "\n";
    }

}

输出:

rs201370260 rs201370260 49841950 1 1.05876 0.0112 3.69E-7
rs201370260 rs75811775 49841652 0.983883 0.98738 0.0131 0.3326
rs201370260 rs200980145 49694713 0.899981 1.05306 0.0117 9.224E-6
rs201370260 rs1052977 49694713 0.894315 1.05043 0.0107 4.477E-6

答案 2 :(得分:0)

只需使用哈希来过滤掉来自文件的重复项。使用1st5th字段的@loci哈希值并在推送#!/usr/bin/perl -w use strict; use warnings; my @loci; my %hash; open( my $loci_in, "<", "file2" ) or die $!; while (<$loci_in>) { my ( $chr, $rsID, $A1, $A2, $bp, $info, $or, $se, $p, $ngt ) = split; next if m/hg19chrc/; if ( !$hash{$chr}{$bp} ) { # check if previously the same record exists push @loci, [ $chr, $rsID, $A1, $A2, $bp, $info, $or, $se, $p, $ngt ]; $hash{$chr}{$bp} = 1; # otherwise set the record } } close $loci_in; #my $filename = shift @ARGV; open( my $input, "<", "file1" ) or die $!; #print "rsID1 rsID2 bp2 r2 or se p\n"; while (<$input>) { next if m/chr/; my ( $chr1, $bp1, $rsID1, $chr2, $bp2, $rsID2, $r2 ) = split; foreach my $locus (@loci) { if ( $chr2 =~ /^$locus->[0]$/ and $bp2 =~ /^$locus->[4]$/ ) { print "$rsID1 $rsID2 $bp2 $r2 $locus->[6] $locus->[7] $locus->[8]\n"; next; } } } close($input); 数组中的值之前检查它。

试试这个:

rs201370260 rs201370260 49841950 1 1.05876 0.0112 3.69E-7
rs201370260 rs200980145 49694713 0.899981 1.05306 0.0117 9.224E-6
rs201370260 rs1052977 49694713 0.894315 1.05306 0.0117 9.224E-6

<强>输出:

Enumerable.Cast<T>(this IEnumerable source)