根据数字范围连接不同文件中的字段

时间:2017-05-29 03:31:35

标签: bash perl awk range

我有三个文件,其中大部分信息相似,但有一个独特的文件。我想将这些组合成一个文件。文件的共同点是标题为hs的列和标题为range1和range2的列。不同的列是标记为f1c,f2c和f3c的列。我想根据范围1和范围2内的区域重叠来组合文件(在这种情况下,hs列也必须匹配)。

它的范围是两个条形,bar1(hs1)有350个部分,bar2(hs2)有700个部分。 f1c f2c和f3c下的值都适用于任一条上这些部分的一定数量。对于适合同一部分的值,我想将它们列在彼此旁边。

bash,awk或perl中的任何解决方案都可以使用,我只是不确定如何根据范围来匹配这些内容。

以下是文件的示例

第一种文件格式

hs  f1c range1 range2
hs1 p32 0      200
hs1 p31 200    300
hs1 p30 300    350
hs2 p32 0      300
hs2 p31 300    500
hs2 p30 500    700

第二档格式

f2c       hs     range1 range2
DDX11L1   hs1    20     50
FAM41C    hs1    50     70
WASH7P    hs1    70     120
FAM138A   hs1    180    250
OR4F5     hs2    0      50
KLHL17    hs2    50     100
PLEKHN1   hs2    100    150
LOC729737 hs2    300    500 
HES4      hs2    500    600
ISG15     hs2    600    700

第三种文件格式

hs  range1 range2 f3c
hs1 0      200    -1
hs1 200    350    -2
hs2 0      500    -1
hs2 500    700    -2

以下是所需输出的示例(如果file2中没有值在该范围内,则在f2c下有一个n)

hs    f1c   f2c      range1 range2 f3c
hs1   p32   n         0      20     -1   // From the 1st line of file3, and the 1st line of file1
hs1   p32   DDX11L1   20     50     -1   // From the 1st line of file1, 1st line of file2 and 1st line of file3
hs1   p32   FAM41C    50     70     -1   // From the 1st line of file1, 2nd line of file2 and 1st line of file3
hs1   p32   WASH7P    70     120    -1   // 1st line file1, 3rd line file2, first line file3
hs1   p32   n         120    180    -1   // 1st line file1, 1st line file3
hs1   p32   FAM13BA   180    200    -1   // 1st line file1, 4th line file2, 1st line file3
hs1   p31   FAM13BA   200    250.   -2   // 2nd line file1, 4th line file2, 2nd line file3
hs1   p31   n         250.   300    -2   // 2nd line file1, 2nd line file3
hs1   p30   n         300    350    -2   // 3rd line file1, 2nd line file3
hs2   p32   OR4FS     0      50     -1
hs2   p32   KLHL17    50     100    -1
hs2   p32   PLEKHN1   100    150    -1
hs2   p32   n         150    300    -1
hs2   p31   LOC729737 300    500    -1
hs2   p30   HES4      500    600    -2
hs2   p30   ISG15     600    700    -2

感谢您

1 个答案:

答案 0 :(得分:2)

我写这篇文章是为了帮助您,但您需要了解您的问题在免费提供诊断帮助的网站上是不可接受的。您不能简单地提出您的要求并等待弹出的高质量免费解决方案。我写了一个答案只是因为这是我国的银行假期而且我感兴趣的问题

我在创作这方面付出的努力比你在编写问题时明显要多,而且你甚至没有费心去回答人们在评论中提出的几个问题

use strict;
use warnings 'all';
use autodie;

use Readonly::Tiny 'Readonly';

Readonly my @FILES       => qw/ file1.txt file2.txt file3.txt /;
Readonly my $FORMAT      => "%-6s%-6s%-10s%-5d%-5d%d\n";
Readonly my @OUTPUT      => qw/ hs f1c f2c range1 range2 f3c /;
Readonly my @KEY_COLUMNS => qw/ hs range1 range2 /;

my %data;   # All the data for each value of `hs` 
my %bounds; # All the values of `range1` or `range2` for each value of `hs` 
my %heads;  # All the headers found in any of the files

# From each file, read the header line and use the
# headers as keys for the data hashes representing each line
#
for my $file ( @FILES ) {

    open my $fh, '<', $file; # Errors handled by `autodie`

    my @head = split ' ', <$fh>;
    @heads{@head} = ();

    while ( <$fh> ) {

        next unless /\S/;

        my %row;
        @row{@head} = split;

        my ($hs, $r1, $r2) = @row{ @KEY_COLUMNS };
        push @{ $data{$hs} }, \%row;

        ++$bounds{$hs}{$_} for $r1, $r2;
    }
}


# Change the `%bounds` hash values from
# hashes to sorted arrays of the boundary values
#
for ( values %bounds ) {

    my @vals = sort {
        my ($aa, $bb) = map { tr/0-9//cdr } $a, $b;
        $aa <=> $bb;
    } keys %$_;

    $_ = \@vals;
}

# Work through the `%bounds` hash
# printing a line of output for each range
#
for my $hs ( sort keys %bounds ) {

    my $bounds = $bounds{$hs};
    my $data   = $data{$hs};

    for my $i ( 1 .. $#$bounds ) {

        my ($r1, $r2) = map { $bounds->[$_] } $i-1, $i;

        my @matches = grep {
            $r1 >= $_->{range1} and $r2 <= $_->{range2}
        } @$data;

        my %row;

        for my $match ( @matches ) {
            @row{ keys %$match } = values %$match;
        }

        @row{ @KEY_COLUMNS } = ($hs, $r1, $r2); # Overwrite in the new key values

        printf $FORMAT, map { $_ // 'n' } @row{ @OUTPUT };
    }
}

输出

hs1   p32   n         0    20   -1
hs1   p32   DDX11L1   20   50   -1
hs1   p32   FAM41C    50   70   -1
hs1   p32   WASH7P    70   120  -1
hs1   p32   n         120  180  -1
hs1   p32   FAM138A   180  200  -1
hs1   p31   FAM138A   200  250  -2
hs1   p31   n         250  300  -2
hs1   p30   n         300  350  -2
hs2   p32   OR4F5     0    50   -1
hs2   p32   KLHL17    50   100  -1
hs2   p32   PLEKHN1   100  150  -1
hs2   p32   n         150  300  -1
hs2   p31   LOC729737 300  500  -1
hs2   p30   HES4      500  600  -2
hs2   p30   ISG15     600  700  -2