如何比较两个文件中的数字字段与awk

时间:2012-05-08 23:32:27

标签: bash awk

我有这两个文件: file1

2537

1279

1075

12799

1474

135441

1260

1169

1281

10759

和 file2

1070,1279960511,BR,USA,UNITED STATES
1278,1279960511,US,USA,UNITED STATES
1279,1279960527,CA,CAN,CANADA
1289,1279967231,US,USA,UNITED STATES
2679,1279971327,CA,CAN,CANADA
1279,1279971839,US,USA,UNITED STATES
1279,1279972095,CA,CAN,CANADA
1279,1279977471,US,USA,UNITED STATES
127997,1279977983,CA,CAN,CANADA
127997,1279980159,US,USA,UNITED STATES
127998,1279980543,CA,CAN,CANADA
107599,1075995007,US,USA,UNITED STATES
107599,1075995023,VG,VGB,VIRGIN ISLANDS, BRITISH
107599,1075996991,US,USA,UNITED STATES
107599,1075997071,CA,CAN,CANADA

我想:对于file1的每个条目,请浏览file2的第一列,当此列中的值变得大于out" file1"然后返回file2的第3个元素 我已经尝试了许多方法,但没有工作我要么得到一个空文件或它打印的东西差异超出我的预期 我的最后一次尝试是:

awk -F, '
BEGIN {FS="," ; i=1 ; while (getline < "file2") { x[i] = $1 ; y[i] = $3 ; i++ }}

{ a[$1] = $1 ; h=1 ; while (x[h] <= a[$1]) { h++ } ; { print y[h] }}' file1

但这永远运行它不会停止既不提供任何东西 帮助我plzzz这已经杀了我好几天了,我放弃了 谢谢

期望的输出:

#this is a comment and i ll write file 2 as if it was a matrix  

because file1[1] > file2[1,1] ... and file1[1] > file2[2,1] .... and file1[1] > file2[3,1] ... and file1[1] > file2[4,1] but file1[1] < file2[5,1] ... then print file2[4,3] ... which is "US"

now go to file1[2] :

file[2] > file2[1,1] ... and file1[2] > file2[2,1] ... but file1[2] <= file2[3,1] ... then print file2[3,3] 

总结我想要打印:&#34;第一行的第三个元素(col)(来自file2)file1元素首先变为&gt;下一行的第一个元素(file2)

4 个答案:

答案 0 :(得分:2)

我将您的AWK脚本作为以下内容的基础。我更改了变量名称以使它们更有意义,因为这有助于自我记录。

#!/usr/bin/awk -f
BEGIN {
    FS=","
    count = 1
    while (getline < "file2") {
        key[count] = $1
        countrycode[count] = $3
        count++
    }
}

{
    for (idx = 1; idx <= count; idx++)
    {
        if ($1 < key[idx]) {
            print countrycode[idx]
            next
        }
    }
}

运行示例(打印$0而不仅仅是$3 - 上面的代码只打印$3):

$ sort -n -k1,1 -t, file2 > tmp; mv tmp file2
$ ./scannums file1
2679,1279971327,CA,CAN,CANADA
1289,1279967231,US,USA,UNITED STATES
1278,1279960511,US,USA,UNITED STATES
127997,1279977983,CA,CAN,CANADA
2679,1279971327,CA,CAN,CANADA
1278,1279960511,US,USA,UNITED STATES
1278,1279960511,US,USA,UNITED STATES
1289,1279967231,US,USA,UNITED STATES
127997,1279977983,CA,CAN,CANADA

请注意,file1中的值135441没有打印任何内容,因为file2中的任何内容都不符合条件。

如果您愿意,可以将其制作成一行。

答案 1 :(得分:2)

这会有用吗?

sort -n -t"," -k1,1 file1 file2 | awk -F"," '{if ($3 != "") {s = $3;} else {print $1 " " s;}}'

产生

1075 BR
1169 BR
1260 BR
1279 US
1281 US
1474 US
2537 US
10759 CA
12799 CA
135441 CA

如果file1中的原始订单很重要,可以使用以下

awk '{print NR "," $1}' file1 file2 | sort -t"," -n -k 2,2 | awk -F"," '{if ($4 != "") {s = $4;} else {print $1 " " s;}}' | sort -t"," -k1,1 | cut -d" " -f2

产生

US
CA
BR
BR
US
CA
US
BR
CA
US

答案 2 :(得分:1)

您是否只能将xargs用于作业的“读取文件1”部分?在awk中单个“在file2中搜索值”部分非常简单,你可以避免使用双文件指针......

编辑:使用xargs和awk的示例。

cat file1 | xargs awk '$1 > ARGV[2] {print $3; return}' file2

编辑:这个例子有效(现在在我的电脑上试过......)

使用-n 1作为xargs的选项,在每次传递中只传递一个参数。存储后删除“val”arg,因此AWK只获取文件名(file2)并知道该怎么做。发现时标记,返回不存在。

cat file1 | xargs -n 1 awk -F, 'BEGIN {val = ARGV[2]; ARGC--; found=0} $1 > val {if (found==0) { print val, $3; found = 1}}' file2

编辑:较短版本

cat file1 | xargs -n 1 awk -F, 'BEGIN {val = ARGV[2]; ARGC--} (!found) && ($1 > val)  {print val, $3; found = 1}' file2

脚本版本:

#!/usr/bin/awk -f
BEGIN {
  val = ARGV[2]
  ARGC--
}
(!found) && ($1 <= val) {
  # cache 3rd column of previous line
  prev = $3
}
(!found) && ($1 > val) {
  # print cached value as soon as we cross the limit
  print val, prev
  found = 1
}

将其命名为find_val.awk和chmod + x。您只需执行find_val.awk somefile somevalue并以相同的方式使用xargs

cat file1 | xargs -n 1 find_val.awk file2

答案 3 :(得分:1)

长单行:

这是你可以做到这一点的一种方式:

cat file1|grep -vE '^$'|while read min; do cat file2|while read line; do val=$(echo $line|cut -d, -f1); if [ $min -lt $val ]; then short_country=$(echo $line|cut -d, -f3); echo $min: $short_country "($val)"; break; fi; done; done

这会产生输出

2537: CA (2679)
1279: US (1289)
1075: US (1278)
12799: CA (127997)
1474: CA (2679)
1260: US (1278)
1169: US (1278)
1281: US (1289)
10759: CA (127997)

解释

如果你在剧本中将其分解,那么它就更容易理解,而不是让它成为一个单行:

#!/bin/bash

cat file1 |                               # read file1
grep -E '^[0-9]+$' |                      # filter out lines in file1 that don't just contain a number
while read min; do                        # for each line in file1:
  cat file2 |                               # read file2
  grep -E '^([0-9]+,){2}[A-Z]{2},' |        # filter out lines in file2 that don't match the right format
  while read line; do                       # for each line in file2:
    val=$(echo $line|cut -d, -f1)             # pull out $val: the first comma-delimited value
    if [ $min -lt $val ]; then                # if it's greater than the $min value read from file1:
      short_country=$(echo $line|cut -d, -f3)   # get the $short_country from the third comma-delimited value in file2
      echo "$min: $short_country ($val)"        # print it to stdout. You can get rid of ($val) here if you're not interested in it.
      break                                     # Now that we've found a value in file2, stop this loop and go to the next line in file1
    fi
  done
done

由于您最初没有指定输出格式,我猜对了。希望这对你有用。