在另一个文件的数量范围内查找文件的数量

时间:2014-07-18 09:42:10

标签: bash perl awk

我有两个输入文件:

file1
1   982444
1   46658343
3   15498261
2   238295146
21  47423507
X   110961739
17  7490379
13  31850803
13  31850989

file2
1   982400  982480
1   46658345    46658350
2   14  109
2   5000    9000
2   238295000   238295560
X   110961739   120000000
17  7490200 8900005

这是我想要的输出:

Desired output:
1   982444
2   238295146
X   110961739
17  7490379

这就是我想要的:在file2的第1列中找到file1的第1列元素。如果数字相同,请取file1的第2列的数量,并检查它是否包含在file2的column2和3的数字范围内。如果包含它,则在输出中打印file1行。

理解可能有点令人困惑,但我正在尽我所能。我已经尝试了一些东西,但我离解决方案很远,任何帮助都会非常感激。请用bash,awk或perl。

提前致谢,

5 个答案:

答案 0 :(得分:3)

只需使用awk。该解决方案不会反复循环file1

#!/usr/bin/awk -f
NR == FNR {
    # I'm processing file2 since NR still matches FNR
    # I'd store the ranges from it on a[] and b[]
    # x[] acts as a counter to the number of range pairs stored that's specific to $1
    i = ++x[$1]
    a[$1, i] = $2
    b[$1, i] = $3
    # Skip to next record; Do not allow the next block to process a record from file2.
    next
}
{
    # I'm processing file1 since NR is already greater than FNR
    # Let's get the index for the last range first then go down until we reach 0.
    # Nothing would happen as well if i evaluates to nothing i.e. $1 doesn't have a range for it.
    for (i = x[$1]; i; --i) {
        if ($2 >= a[$1, i] && $2 <= b[$1, i]) {
            # I find that $2 is within range. Now print it.
            print
            # We're done so let's skip to the next record.
            next
        }
    }
}

用法:

awk -f script.awk file2 file1

输出:

1   982444
2   238295146
X   110961739
17  7490379

使用Bash(版本4.0或更高版本)的类似方法:

#!/bin/bash

FILE1=$1 FILE2=$2

declare -A A B X

while read F1 F2 F3; do
    (( I = ++X[$F1] ))
    A["$F1|$I"]=$F2
    B["$F1|$I"]=$F3
done < "$FILE2"

while read -r LINE; do
    read F1 F2 <<< "$LINE"
    for (( I = X[$F1]; I; --I )); do
        if (( F2 >= A["$F1|$I"] && F2 <= B["$F1|$I"] )); then
            echo "$LINE"
            continue
        fi
    done
done < "$FILE1"

用法:

bash script.sh file1 file2

答案 1 :(得分:2)

让我们混合bash和awk:

while read col min max
do
    awk -v col=$col -v min=$min -v max=$max '$1==col && min<=$2 && $2<=max' f1
done < f2

解释

  • 对于file2的每一行,读取最小值和最大值,以及第一列的值。
  • 鉴于这些值,请在file1中检查具有相同第一列且在文件2指定的范围内的第二列的行。

测试

$ while read col min max; do awk -v col=$col -v min=$min -v max=$max '$1==col && min<=$2 && $2<=max' f1; done < f2
1   982444
2   238295146
X   110961739
17  7490379

答案 2 :(得分:0)

Pure bash,基于Fedorqui解决方案

#!/bin/bash
while read col_2 min max
do
    while read col_1 val
    do
       (( col_1 == col_2 && ( min <= val && val <= max ) )) && echo $col_1 $val
    done < file1
done < file2

答案 3 :(得分:0)

cut -d' ' -f1 input2 | sed 's/^/^/;s/$/\\s/' | \ 
    grep -f - <(cat input2 input1) | sort -n -k1 -k3 | \ 
    awk 'NF==3 { 
            split(a,b,","); 
            for (v in b) 
                if ($2 <= b[v] && $3 >= b[v]) 
                    print $1, b[v]; 
                if ($1 != p) a=""} 
         NF==2 {p=$1;a=a","$2}' 

产地:

X 110961739
1 982444
2 238295146
17 7490379

答案 4 :(得分:0)

这是一个Perl解决方案。如果我用file2构建一个哈希值,它可能会更快但更简洁,但这应该没问题。

use strict;
use warnings;
use autodie;

my @bounds = do {
  open my $fh, '<', 'file2';
  map [ split ], <$fh>;
};

open my $fh, '<', 'file1';
while (my $line = <$fh>) {
  my ($key, $val) = split ' ', $line;
  for my $bound (@bounds) {
    next unless $key eq $bound->[0] and $val >= $bound->[1] and $val <= $bound->[2];
    print $line;
    last;
  }
}

<强>输出

1   982444
2   238295146
X   110961739
17  7490379