Question

我有两个文件，一个包含单个条目列表（fileA），另一个文件包含范围列表（fileB）。

我想知道在fileB的任何范围内找到fileA中的哪些条目。

两个文件中的示例条目都是

的fileA

00100500000000
00100600000000
00100700000000
00100800000000
00100900000000
00101000000000 
00101300000000
00101500000000
00101600000000
00101700000000
00101710000000
00101800000000
35014080000000
35014088000000
35067373000000

FILEB

00100200000000,00100200999999
00100300000000,00100300999999
00100100000000,00100100999999
00100400000000,00100400999999
00100500000000,00100500999999
00100600000000,00100600999999
00100700000000,00100700999999
00100800000000,00100800999999
00100900000000,00100900999999
00101000000000,00101000999999
00101300000000,00101300999999
00101500000000,00101500999999
00101600000000,00101600999999
35048702000000,35048702999999
35048802000000,35048802999999
35077160000000,35077160999999
35077820000000,35077820999999
35085600000000,35085600999999

我使用了下面的脚本，但是在fileA和50k的fileB中完成140k条目大约需要6天。有没有办法让它更快？

list=`cat fileB`
for mobno in $list
do
  LowVal="$(echo $mobno | cut -d, -f1)"
  HighVal="$(echo $mobno | cut -d, -f2)"

 while read ThisLine; 
do [ ${ThisLine} -ge ${LowVal} ] && [ ${ThisLine} -le ${HighVal} ] && echo "${ThisLine}";done < fileA; 
done;

Answer 1

您必须测试它的性能，但以下awk脚本解决方案是一个选项：

NR == 1 && FNR == 1 { strt=1
        }
FNR == 1 && NR != 1 {
        strt=0
        }
strt==0 {
        pos=$0
        for (i in ranges) {
                split(i,arry,",")
                if ( pos >= arry[1] && pos <= arry[2]) {
                        print i" - "$0
                        }
                }
        }
strt==1 {ranges[$0]=""
        }

使用以下命令运行：

 awk -f awkfile file B file A

输出：

00100500000000,00100500999999 - 00100500000000
00100600000000,00100600999999 - 00100600000000
00100700000000,00100700999999 - 00100700000000
00100800000000,00100800999999 - 00100800000000
00100900000000,00100900999999 - 00100900000000
00101000000000,00101000999999 - 00101000000000
00101300000000,00101300999999 - 00101300000000
00101500000000,00101500999999 - 00101500000000
00101600000000,00101600999999 - 00101600000000
00101700000000,00101700999999 - 00101700000000
00101710000000,00101710999999 - 00101710000000
00101800000000,00101800999999 - 00101800000000

我们实际上是在使用变量strt读取这两个文件来确定一个文件的结尾和另一个文件的开头。我们将范围读入一个数组（范围），然后从范围和fileA中的每个值中删除前导零以进行比较。

Answer 2

两种方法：

- 使用 grep ：

grep -of fileA fileB

- comm + 排序 + sed < / strong>命令：

comm -12 <(sort fileA) <(sed 's/,/\n/' fileB | sort)

输出：

00100500000000 00100600000000 00100700000000 00100800000000 00100900000000 00101300000000 00101500000000 00101600000000 00101700000000 00101710000000 00101800000000

Answer 3

如果fileB上的范围像您的示例中那样升序，则只需要将第一个和最后一个值设置为LowVal和HighVal。试试这个：

LowVal=$(head -n1 fileB | cut -d, -f1)
HighVal=$(tail -n1 fileB | cut -d, -f2)

awk -vHighVal=$HighVal -vLowVal=$LowVal '$0 >= LowVal && $0 <= HighVal' fileA

Answer 4

切割似乎很慢，这就是为什么花费这么多时间。试试这段代码

list=`cat fileB`
for mobno in $list
do
  IFS=', ' read -r -a array <<< $mobno
  LowVal=${array[0]}
  HighVal=${array[1]}

 while read ThisLine; 
do [ ${ThisLine} -ge ${LowVal} ] && [ ${ThisLine} -le ${HighVal} ] && echo "${ThisLine}";done < fileA; 
done;

Answer 5

这是我对此的看法。 awk是要使用的工具。这是一个单行：

$ awk -F, 'NR==FNR{range[$1]=$2;next}{for(low in range){if($1>=low&&$1<=range[low]){print $1}}}' fileB fileA

拆分以便于评论：

$ awk '

    BEGIN {
      RS=","         # Record separator, "-F," in the one-liner
    }

    NR==FNR {        # Run this bit on just the first file specified, your ranges
      range[$1]=$2   # Store the range in an array
      next
    }

    {                           # For each value in your data file,
      for (low in range) {      # step through the ranges
        if ($1 >= low && $1 <= range[low]) {  # and test.
          print $1              # If they pass, print the value.
        }
      }
    }

  ' fileB fileA

请注意，这会将整个范围集合作为数组加载到内存中，因此如果fileB长达数百万行，则可能会出现问题。试试看。

请注意，此解决方案不依赖于正在排序的文件或任何特定顺序，但它假定您没有具有共同低点的范围。也就是说，您不会5 ... 8和5 ... 10一起。您的样本数据没有任何这些，但它只是一个样本。

我很想知道这个解决方案对你的6天版本有多大影响。： - ）

更新＃1

这是bash中的相同逻辑，为了它的乐趣。再一次，我很想看到你的数据集的速度比较！

$ declare -A range=()
$ while IFS=, read -r a b; do range["$a"]="$b"; done < fileB
$ while read -r val; do for low in "${!range[@]}"; do [[ 10#$val -ge 10#$low && 10#$val -le 10#${range[$low]} ]] && echo "$val"; done; done < fileA

或者，打破脚本风格（带注释）

declare -A range=()

while IFS=, read -r a b; do
  range["$a"]="$b"                      # Store the ranges in an associative array
done < fileB                            # (requires bash 4+)

while read -r val; do                   # Read values...
  for low in "${!range[@]}"; do         # Step through our range, and
    [[ 10#$val -ge 10#$low && 10#$val -le 10#${range[$low]} ]] &&
    echo "$val"                         # test and print.
  done
done < fileA

这里一个迟钝的东西是测试中值的开头的10#。这里是因为没有它们，bash会将带有前导零的整数解释为八进制数，这会使数据集失败，因为它包含8和9。： - ）

更新＃2

纯粹出于实验目的，这里有一个可能适用于bash版本3的变体。

这仍然使用数组，但是传统的数组而不是关联数组。因此，索引是数字的，因此$low的数字比较不再需要基本填充（10#）。

declare -a range=()

while IFS=, read -r a b; do
  range[10#"$a"]="$b"                      # Store the ranges in an associative array
done < fileB                            # (requires bash 4+)

while read -r val; do                   # Read values...
  for low in "${!range[@]}"; do         # Step through our range, and
    [[ 10#$val -ge 10#$low && 10#$val -le 10#${range[$low]} ]] &&
    echo "$val"                         # test and print.
  done
done < fileA

验证是否在范围

5 个答案: