Question

我有两个文件file1.txt和file2.txt，如下所示 -

file1.txt                  file2.txt 
col1(date)                 col1(Date)
col2(number: 4343250019 )  col2(last value of number: 9)
col3(number)               col3(number)
col5(alphanumeric)         col5(alphanumeric)

要求是获取不可用的记录 file1.txt使用以下条件。

2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one

预期产出：

###Replacing the space/tab from the file1.txt with pipe
awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' OFS="|" file1.txt > file1.txt1

### Looping on a combination of four column of file1.txt1 with combination of modified column of file2.txt and output in output.txt
awk 'BEGIN{FS=OFS="|"} {a[$1FS$2FS$3FS$5];next} {(($1 FS substr($2,length($2),1) FS $3 FS $5) in a) print $0}' file2.txt file1.txt1 > output.txt

###And finally, replace the "N" from column 8th and put "NULL" if the value is "N".
awk -F'|' '{ gsub ("N","NULL",$8);print}' OFS="|" output.txt >  output.txt1

此输出行在file1.txt中不可用，但可用于满足匹配条件后的file2.txt。

我正在尝试以下步骤来实现此输出 -

{{1}}

问题是什么？

我的第二次操作无效，我试图将所有3个操作放在一个操作中。

Answer 1

awk -F'[|]|[[:blank:]]+' 'FNR==NR{E[$1($2%10)$3$5]++;next}!($1$2$3$5 in E)' file1.txt file2.txt

并且您的示例输出错误，它应该是（如果不同，则为最后一个字段：data453 3 3）

2016-07-20-22|9|1003116|001|data45333|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2017-06-22-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one

评论代码

# separator for both file first with blank, second with `|`
awk -F'[|]|[[:blank:]]+' '
   # for first file
   FNR==NR{
      # create en index entry based on the 4 field. The forat of filed allow to use them directly without separator (univoq)
      E[ $1 ( $2 % 10 ) $3 $5 ]++
      # for this line (file) don't go further
      next
      }

   # for next file lines

   # if not in the index list of entry, print the line (default action)
   ! ( ( $1 $2 $3 $5 ) in E ) { print }
   ' file1.txt file2.txt

Answer 2

您可以试试params.nodes[0]：

awk

下面，

awk -F'[ |]*' 'NR==FNR{su=substr($2,length($2),1); a[$1":"su":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{print $0}' f1 f2 - 关联数组
a[] - 这构成了数组索引的键。 $1":"su":"$3":"$5是字段su（$2）的最后一位数字。然后，为此密钥指定su=substr($2,length($2),1)作为值。
1 - 此块适用于处理NR==FNR{...;next}。

<强>更新

f1

Answer 3

<强>输入

$ cat f1
2016-07-20-22   4343250019    1003116 001 data45343    25-JUL-16 11-MAR-16 1            N            0          0 N 
2016-06-20-22       654650018    1003116 001 data45343    25-JUL-17 11-MAR-16 1           N            0      0 N 

$ cat f2
2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one

<强>输出

$ awk 'FNR==NR{a[$1,substr($2,length($2)),$3,$5];next}!(($1,$2,$3,$5) in a)' f1 FS="|" f2
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one

<强>解释

awk '                                          # call awk.
 FNR==NR{                                      # This is true when awk reads first file
           a[$1,substr($2,length($2)),$3,$5]   # array a where index being $1(field1), last char from $2, $3 and $5
           next                                # stop processing go to next line
        }
        !(($1,$2,$3,$5) in a)                  # here we check index $1,$2,$3,$5 exists in array a by reading file f2
       ' f1 FS="|" f2                          # Read f1 then 
                                               # set FS and then read f2

FNR==NR如果到目前为止在当前文件中读取的记录数
  等于所有文件到目前为止读取的记录数，
  条件，只有在第一个文件读取时才为真。
a[$1,substr($2,length($2)),$3,$5]填充数组“a”，使得由第一个索引字段，第二个字段的最后一个字符，第三个字段和第五个字段 file1的当前记录
next转到下一条记录，这样我们就不会进行任何处理用于第二个文件中的记录。
!(($1,$2,$3,$5) in a)如果数组a索引是从...构造的 file2当前记录的字段（$1,$2,$3,$5）不存在在数组a中，我们得到布尔值true （!被调用的逻辑非运算符。它用于反转其操作数的逻辑状态。如果条件为真，则逻辑NOT运算符将使它是假的，反之亦然。）所以awk从file2
执行默认操作print $0
f1 FS="|" f2读取文件1（f1），设置字段分隔符“|”后读取第一个文件，然后读取file2（f2）

<强> - 编辑 -

当filesize是大约60GB（9亿行）时，它不是很好想要处理文件两次。第3次操作 - （用“N”代替“N” col的“NULL” - 8“”awk -F'|' '{gsub（“N”，“NULL”，$ 8）; print}' OFS = “|” output.txt的

$ awk 'FNR==NR{
         a[$1,substr($2,length($2)),$3,$5];
         next
      }
     !(($1,$2,$3,$5) in a){ 
         sub(/N/,"NULL",$8); 
         print
     }' f1 FS="|" OFS="|" f2

2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one

如何使用多个条件找到两个文件之间的差异？

3 个答案: