我想找到文件1中的字段1和5与文件2中的字段1和5匹配的行,以及这些行,从文件1和字段7,8打印字段3,6,5和7,和文件2中的9。
文件1:
5 49841950 rs201370260 5 49841950 rs201370260 1
5 49841950 rs201370260 5 49841652 rs75811775 0.983883
5 49841950 rs201370260 5 49694713 rs200980145 0.899981
5 49841950 rs201370260 5 49694713 rs1052977 0.894315
文件2:
5 5_49841950_D I2 D 49841950 0.882 1.05876 0.0112 3.69E-7 0
5 rs28680688 C G 12114 0.842 0.98738 0.0131 0.3326 0
5 5_49694713_I I2 D 49694713 0.864 1.05306 0.0117 9.224E-6 0
5 rs1052977 A T 49694713 0.982 1.05043 0.0107 4.477E-6 0
我有一个脚本来执行此操作:
#! perl -w
use strict;
use warnings;
my @loci;
open( my $loci_in, "<", "File 2" ) or die $!;
while (<$loci_in>) {
my ( $chr, $rsID, $A1, $A2, $bp, $info, $or, $se, $p, $ngt ) = split;
next if m/hg19chrc/;
push @loci, [$chr, $rsID, $A1, $A2, $bp, $info, $or, $se, $p, $ngt];
}
close $loci_in;
my $filename = shift @ARGV;
open( my $input, "<", "File 1") or die $!;
print "rsID1 rsID2 bp2 r2 or se p\n";
while (<$input>) {
next if m/chr/;
my ( $chr1, $bp1, $rsID1, $chr2, $bp2, $rsID2, $r2 ) = split;
foreach my $locus (@loci) {
if ( $chr2 =~ /^$locus->[0]$/
and $bp2 =~ /^$locus->[4]$/)
{
print "$rsID1 $rsID2 $bp2 $r2 $locus->[6] $locus->[7] $locus->[8]\n";
next;
}
}
}
close($input);
当有多个匹配字段1和5时,我遇到了问题。例如,文件1条目
5 49841950 rs201370260 5 49694713 rs200980145 0.899981
5 49841950 rs201370260 5 49694713 rs1052977 0.894315
匹配两个文件2条目:
5 5_49694713_I I2 D 49694713 0.864 1.05306 0.0117 9.224E-6 0
5 rs1052977 A T 49694713 0.982 1.05043 0.0107 4.477E-6 0
因此输出有4条输出线,它应该只有两条:
rs201370260 rs200980145 49694713 0.899981 1.05306 0.0117 9.224E-6
rs201370260 rs200980145 49694713 0.899981 1.05043 0.0107 4.477E-6
rs201370260 rs1052977 49694713 0.894315 1.05306 0.0117 9.224E-6
rs201370260 rs1052977 49694713 0.894315 1.05043 0.0107 4.477E-6
所需的输出将是:
rs201370260 rs200980145 49694713 0.899981 1.05306 0.0117 9.224E-6
rs201370260 rs1052977 49694713 0.894315 1.05043 0.0107 4.477E-6
有没有人有perl或awk解决方案?
答案 0 :(得分:0)
根据您的描述,您只需要:
$ cat tst.awk
NR==FNR { file1[$1,$5] = $3 OFS $6 OFS $7; next }
($1,$5) in file1 { print file1[$1,$5], $7, $8, $9 }
$ awk -f tst.awk file1 file2
rs201370260 rs201370260 1 1.05876 0.0112 3.69E-7
rs201370260 rs1052977 0.894315 1.05306 0.0117 9.224E-6
rs201370260 rs1052977 0.894315 1.05043 0.0107 4.477E-6
但我不知道为什么你的预期输出是你在问题中显示的2行,而不是上面的3行。您需要编辑您的问题,以告诉我们您希望如何解决file1和file2中的重复键。
答案 1 :(得分:0)
以下是否符合您的要求?我玩的时候不是很优雅。请注意,数组从0
开始,因此您指定的每个“字段”数字都减少了1.
use warnings;
use strict;
open my $fh1, '<', 'f1.txt' or die $!;
open my $fh2, '<', 'f2.txt' or die $!;
my @fh2 = <$fh2>;
while (<$fh1>){
my $fh2_elem = shift @fh2;
my $same = 0;
for my $col (qw(0 4)){
if(((split)[$col]) eq ((split /\s+/, $fh2_elem)[$col])){
$same = 1;
}
last if ! $same;
}
if ($same){
print( (join ' ', (split)[2,5,4,6], (split /\s+/, $fh2_elem)[6,7,8]) );
print "\n";
}
}
输出:
rs201370260 rs201370260 49841950 1 1.05876 0.0112 3.69E-7
rs201370260 rs75811775 49841652 0.983883 0.98738 0.0131 0.3326
rs201370260 rs200980145 49694713 0.899981 1.05306 0.0117 9.224E-6
rs201370260 rs1052977 49694713 0.894315 1.05043 0.0107 4.477E-6
答案 2 :(得分:0)
只需使用哈希来过滤掉来自文件的重复项。使用1st
和5th
字段的@loci
哈希值并在推送#!/usr/bin/perl -w
use strict;
use warnings;
my @loci;
my %hash;
open( my $loci_in, "<", "file2" ) or die $!;
while (<$loci_in>) {
my ( $chr, $rsID, $A1, $A2, $bp, $info, $or, $se, $p, $ngt ) = split;
next if m/hg19chrc/;
if ( !$hash{$chr}{$bp} ) { # check if previously the same record exists
push @loci, [ $chr, $rsID, $A1, $A2, $bp, $info, $or, $se, $p, $ngt ];
$hash{$chr}{$bp} = 1; # otherwise set the record
}
}
close $loci_in;
#my $filename = shift @ARGV;
open( my $input, "<", "file1" ) or die $!;
#print "rsID1 rsID2 bp2 r2 or se p\n";
while (<$input>) {
next if m/chr/;
my ( $chr1, $bp1, $rsID1, $chr2, $bp2, $rsID2, $r2 ) = split;
foreach my $locus (@loci) {
if ( $chr2 =~ /^$locus->[0]$/
and $bp2 =~ /^$locus->[4]$/ )
{
print
"$rsID1 $rsID2 $bp2 $r2 $locus->[6] $locus->[7] $locus->[8]\n";
next;
}
}
}
close($input);
数组中的值之前检查它。
试试这个:
rs201370260 rs201370260 49841950 1 1.05876 0.0112 3.69E-7
rs201370260 rs200980145 49694713 0.899981 1.05306 0.0117 9.224E-6
rs201370260 rs1052977 49694713 0.894315 1.05306 0.0117 9.224E-6
<强>输出:强>
Enumerable.Cast<T>(this IEnumerable source)