我有三个文件,其中大部分信息相似,但有一个独特的文件。我想将这些组合成一个文件。文件的共同点是标题为hs的列和标题为range1和range2的列。不同的列是标记为f1c,f2c和f3c的列。我想根据范围1和范围2内的区域重叠来组合文件(在这种情况下,hs列也必须匹配)。
它的范围是两个条形,bar1(hs1)有350个部分,bar2(hs2)有700个部分。 f1c f2c和f3c下的值都适用于任一条上这些部分的一定数量。对于适合同一部分的值,我想将它们列在彼此旁边。
bash,awk或perl中的任何解决方案都可以使用,我只是不确定如何根据范围来匹配这些内容。
以下是文件的示例
第一种文件格式
hs f1c range1 range2
hs1 p32 0 200
hs1 p31 200 300
hs1 p30 300 350
hs2 p32 0 300
hs2 p31 300 500
hs2 p30 500 700
第二档格式
f2c hs range1 range2
DDX11L1 hs1 20 50
FAM41C hs1 50 70
WASH7P hs1 70 120
FAM138A hs1 180 250
OR4F5 hs2 0 50
KLHL17 hs2 50 100
PLEKHN1 hs2 100 150
LOC729737 hs2 300 500
HES4 hs2 500 600
ISG15 hs2 600 700
第三种文件格式
hs range1 range2 f3c
hs1 0 200 -1
hs1 200 350 -2
hs2 0 500 -1
hs2 500 700 -2
以下是所需输出的示例(如果file2中没有值在该范围内,则在f2c下有一个n)
hs f1c f2c range1 range2 f3c
hs1 p32 n 0 20 -1 // From the 1st line of file3, and the 1st line of file1
hs1 p32 DDX11L1 20 50 -1 // From the 1st line of file1, 1st line of file2 and 1st line of file3
hs1 p32 FAM41C 50 70 -1 // From the 1st line of file1, 2nd line of file2 and 1st line of file3
hs1 p32 WASH7P 70 120 -1 // 1st line file1, 3rd line file2, first line file3
hs1 p32 n 120 180 -1 // 1st line file1, 1st line file3
hs1 p32 FAM13BA 180 200 -1 // 1st line file1, 4th line file2, 1st line file3
hs1 p31 FAM13BA 200 250. -2 // 2nd line file1, 4th line file2, 2nd line file3
hs1 p31 n 250. 300 -2 // 2nd line file1, 2nd line file3
hs1 p30 n 300 350 -2 // 3rd line file1, 2nd line file3
hs2 p32 OR4FS 0 50 -1
hs2 p32 KLHL17 50 100 -1
hs2 p32 PLEKHN1 100 150 -1
hs2 p32 n 150 300 -1
hs2 p31 LOC729737 300 500 -1
hs2 p30 HES4 500 600 -2
hs2 p30 ISG15 600 700 -2
感谢您
答案 0 :(得分:2)
我写这篇文章是为了帮助您,但您需要了解您的问题在免费提供诊断帮助的网站上是不可接受的。您不能简单地提出您的要求并等待弹出的高质量免费解决方案。我写了一个答案只是因为这是我国的银行假期而且我感兴趣的问题
我在创作这方面付出的努力比你在编写问题时明显要多,而且你甚至没有费心去回答人们在评论中提出的几个问题
use strict;
use warnings 'all';
use autodie;
use Readonly::Tiny 'Readonly';
Readonly my @FILES => qw/ file1.txt file2.txt file3.txt /;
Readonly my $FORMAT => "%-6s%-6s%-10s%-5d%-5d%d\n";
Readonly my @OUTPUT => qw/ hs f1c f2c range1 range2 f3c /;
Readonly my @KEY_COLUMNS => qw/ hs range1 range2 /;
my %data; # All the data for each value of `hs`
my %bounds; # All the values of `range1` or `range2` for each value of `hs`
my %heads; # All the headers found in any of the files
# From each file, read the header line and use the
# headers as keys for the data hashes representing each line
#
for my $file ( @FILES ) {
open my $fh, '<', $file; # Errors handled by `autodie`
my @head = split ' ', <$fh>;
@heads{@head} = ();
while ( <$fh> ) {
next unless /\S/;
my %row;
@row{@head} = split;
my ($hs, $r1, $r2) = @row{ @KEY_COLUMNS };
push @{ $data{$hs} }, \%row;
++$bounds{$hs}{$_} for $r1, $r2;
}
}
# Change the `%bounds` hash values from
# hashes to sorted arrays of the boundary values
#
for ( values %bounds ) {
my @vals = sort {
my ($aa, $bb) = map { tr/0-9//cdr } $a, $b;
$aa <=> $bb;
} keys %$_;
$_ = \@vals;
}
# Work through the `%bounds` hash
# printing a line of output for each range
#
for my $hs ( sort keys %bounds ) {
my $bounds = $bounds{$hs};
my $data = $data{$hs};
for my $i ( 1 .. $#$bounds ) {
my ($r1, $r2) = map { $bounds->[$_] } $i-1, $i;
my @matches = grep {
$r1 >= $_->{range1} and $r2 <= $_->{range2}
} @$data;
my %row;
for my $match ( @matches ) {
@row{ keys %$match } = values %$match;
}
@row{ @KEY_COLUMNS } = ($hs, $r1, $r2); # Overwrite in the new key values
printf $FORMAT, map { $_ // 'n' } @row{ @OUTPUT };
}
}
hs1 p32 n 0 20 -1
hs1 p32 DDX11L1 20 50 -1
hs1 p32 FAM41C 50 70 -1
hs1 p32 WASH7P 70 120 -1
hs1 p32 n 120 180 -1
hs1 p32 FAM138A 180 200 -1
hs1 p31 FAM138A 200 250 -2
hs1 p31 n 250 300 -2
hs1 p30 n 300 350 -2
hs2 p32 OR4F5 0 50 -1
hs2 p32 KLHL17 50 100 -1
hs2 p32 PLEKHN1 100 150 -1
hs2 p32 n 150 300 -1
hs2 p31 LOC729737 300 500 -1
hs2 p30 HES4 500 600 -2
hs2 p30 ISG15 600 700 -2