如何使用awk

时间:2019-05-28 22:01:34

标签: bash shell unix awk command-line

摘要:

我目前有两个从正在测试的调查系统中导入的.txt文件。每个数据文件的第1列是格式为“ HHMMSS.SSSSSS”的时间戳。在文件1中,有第二列的场强读数。在file2中,还有两列位置信息。我正在尝试编写脚本,以通过对齐时间戳来匹配这些文件之间的数据点。问题在于,任何时间戳都绝对没有相同的值。该脚本必须能够基于另一个文件中最接近的时间戳来匹配数据点(每个.txt文件中的行)(即,来自file1的时间125051.354948应该“匹配” file2中最接近的时间戳,即125051.112784)。

如果有人对awk / sed / join / regex / Unix有更多了解,可以为我指明正确的方向,那我将非常感激。

我到目前为止所拥有的:

(请注意,此处显示的确切语法对于此问题中附加的示例.txt文件可能没有意义,这些文件有更广泛的版本,其中更多列用于测试脚本。)

我是awk / Unix / shell脚本的新手,所以如果这些试用解决方案中的某些不起作用或没有任何意义,请耐心等待。

我已经尝试过使用join在这里介绍一些堆栈溢出的解决方案,但是似乎并不想正确地对这些文件之一进行排序或联接:

    ${
      join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 file1) <(sort -k 1 file2)     
      join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 file1) <(sort -k 1 
    file2) 
    } | sort -k 1
  • 结果:仅输出原始文件的相似版本2

我试图重新配置在此处发布的现有awk解决方案:

    awk 'BEGIN {FS=OFS="\t"} NR==FNR {v[$3]=$2; next} {print $1, (v[$3] ? 
    v[$3] : 0)}' file1 file2 > file3


    awk 'BEGIN {FS=OFS="\t"} NR==FNR {v[$1]=$2; next} {print $1, (v[$1] ? 
    v[$1] : 0)}' file1 file2 > file3
  • 结果:这两个awk命令都导致file2的输出 数据中没有包含file1中的任何内容(或者看起来如此)。

    awk -F '
    FNR == NR {
        time[$3]
        next
    }
    {   for(i in time)
            if(index($3, i) == 1) {
                print
                next
    
            }
    }' file1 file2 > file3
    
  • 结果:不断返回有关“”的语法错误。 “ .txt”

我曾考虑将某种正则表达式或split命令集成到脚本中...但是对于如何进行却感到困惑,并且没有提出任何实质内容。

样本数据

    $ cat file1.txt

    125051.354948 058712.429

    125052.352475 058959.934

    125054.354322 058842.619

    125055.352671 058772.045

    125057.351794 058707.281

    125058.352678 058758.959


    $ cat file2.txt

    125050.105886 4413.34358 07629.87620

    125051.112784 4413.34369 07629.87606

    125052.100811 4413.34371 07629.87605

    125053.097826 4413.34373 07629.87603

    125054.107361 4413.34373 07629.87605

    125055.107038 4413.34375 07629.87604

    125056.093783 4413.34377 07629.87602

    125057.097928 4413.34378 07629.87603

    125058.098475 4413.34378 07629.87606

    125059.095787 4413.34376 07629.87602

预期结果:

(格式:Column1File1 Column1File2 Column2File1 Column2File2 Column3File2)

    $ cat file3.txt

    125051.354948 125051.112784 058712.429 4413.34358 07629.87620

    125052.352475 125052.100811 058959.934 4413.34371 07629.87605

    125054.354322 125054.107361 058842.619 4413.34373 07629.87605

    125055.352671 125055.107038 058772.045 4413.34375 07629.87604

    125057.351794 125057.097928 058707.281 4413.34378 07629.87603

    125058.352678 125058.098475 058758.959 4413.34378 07629.87606

如图所示,并非每个文件中的每个数据点都会找到一个匹配项。只有时间戳彼此最接近的几对线才会被写入新文件

如前所述,当前解决方案导致file3完全空白,或者仅包含来自两个文件之一(但不是两个文件)的信息

2 个答案:

答案 0 :(得分:0)

请尝试以下操作:

awk '
    # find the closest element in "a" to val and return the index
    function binsearch(a, val, len,
        low, high, mid) {
        if (val < a[1])
            return 1
        if (val > a[len])
            return len

        low = 1
        high = len
        while (low <= high) {
            mid = int((low + high) / 2)
            if (val < a[mid])
                high = mid - 1
            else if (val > a[mid])
                low = mid + 1
            else
                return mid
        }
        return (val - a[low]) < (a[high] - val) ? high : low
    }
    NR == FNR {
        time[FNR] = $1
        position[FNR] = $2
        intensity[FNR] = $3
        len++
        next
    }
    {
        i = binsearch(time, $1, len)
        print $1 " " time[i] " " $2 " " position[i] " " intensity[i]
    }
' file2.txt file1.txt

结果:

125051.354948 125051.112784 058712.429 4413.34369 07629.87606
125052.352475 125052.100811 058959.934 4413.34371 07629.87605
125054.354322 125054.107361 058842.619 4413.34373 07629.87605
125055.352671 125055.107038 058772.045 4413.34375 07629.87604
125057.351794 125057.097928 058707.281 4413.34378 07629.87603
125058.352678 125058.098475 058758.959 4413.34378 07629.87606

请注意,预期结果中的第4个和第5个值可能会错误地复制粘贴。

[工作原理]

键是 binsearch 函数,该函数可在 数组,并将索引返回到数组。我不会提及 详细介绍该算法,因为它是一种常见的“二进制搜索”技术。

答案 1 :(得分:0)

#!/bin/bash

if [[ $# -lt 2 ]]; then
  echo "wrong args, it should be $0 file1 file2"
  exit 0
fi

# clear blanks, add an extra column 'm' to file1, merge file1, file2, sort
{ awk 'NF{print $0, "m"}' "$1" ; awk 'NF' "$2"; } | sort -nk1,1 | \
  \
  awk '# record lines and fields in to a
       {a[NR] = $0; a[NR,1] = $1; a[NR,2] = $2; a[NR,3] = $3}
       END{
         for(i=1; i<= NR; ++i){

           # 3rd filed of file1 is "m"
           if(a[i, 3] == "m"){

             # get difference of column1 between current record ,previous record, next record
             prevDiff = (i-1) in a && a[i-1,3] == "m" ? -1 : a[i,1] - a[i-1,1]
             nextDiff = (i+1) in a && a[i+1,3] == "m" ? -1 : a[i+1,1] - a[i,1]

             # compare differences, choose the close one and print.
             if(prevDiff !=-1 && (nextVal == -1 || prevDiff < nextDiff))
               print a[i,1], a[i-1, 1], a[i, 2], a[i-1, 2], a[i-1, 3]
             else if(nextDiff !=-1 && (prevDiff == -1 || nextDiff < prevDiff))
               print a[i,1], a[i+1, 1], a[i, 2], a[i+1, 2], a[i+1, 3]
             else
               print a[i]
           }
         }
       }'

{ awk 'NF{print $0, "m"}' "$1" ; awk 'NF' "$2"; } | sort -nk1,1的输出是:

125050.105886 4413.34358 07629.87620
125051.112784 4413.34369 07629.87606
125051.354948 058712.429 m
125052.100811 4413.34371 07629.87605
125052.352475 058959.934 m
125053.097826 4413.34373 07629.87603
125054.107361 4413.34373 07629.87605
125054.354322 058842.619 m
125055.107038 4413.34375 07629.87604
125055.352671 058772.045 m
125056.093783 4413.34377 07629.87602
125057.097928 4413.34378 07629.87603
125057.351794 058707.281 m
125058.098475 4413.34378 07629.87606
125058.352678 058758.959 m
125059.095787 4413.34376 07629.87602