Question

我尝试编写一个脚本，用于比较基于第2列的2个大文件。每个文件包含大约1百万条记录。对于输出，我需要知道哪些记录在第2列上是常见的（存在于两个文件中）但在第1列中具有不同的值。这些文件是引用逗号分隔的值文件

File1_pair

20151026,1111
20141113,2222
20130102,3333
77777777,9999

File2_pair
20151026,1111
20203344,2222
50506677,3333
77777777,8888

Desired_output
20141113,2222,20203344
20130102,3333,50506677

我尝试修改下面的脚本，但无法正确使用。

awk 'FNR==NR { a[$0]; next } !($2) in a { c++ } END { print c }' file1_pair file2_pair`

Answer 1

你有正确的想法，你只是在错误的领域上操作。

您需要保存数组中第一个文件的所有$2值，然后检查第二个文件中针对该数组的$2值。您还需要比较相应行中$1的值。

这个awk脚本会这样做。

awk -F , -v OFS=, '
    NR==FNR {
        # Store the value of $1 under the $2 key in a
        a[$2]=$1
        next
    }
    # If $2 is in a (we've seen this value before) and
    # if the value in the array (first file's $1 value) doesn't match this files $1 value
    ($2 in a) && (a[$2] != $1) {
        # Print the original $1 value (from the array),$2,$1
        print a[$2],$2,$1
    }' file1_pair file2_pair

比较基于列的文件

1 个答案: