比较linux中的两个CSV文件

时间:2015-02-11 16:46:53

标签: linux csv awk

我有以下格式的两个CSV文件:

File1中:

No.1, No.2
983264,72342349
763498,81243970
736493,83740940

文件2:

No.1,No.2
"7938493","7364987"
"2153187","7387910"
"736493","83740940"

我需要比较两个文件并输出匹配的,不匹配的值。 我是通过awk做到的:

#!/bin/bash

awk 'BEGIN {
    FS = OFS = ","
}
if (FNR==1){next}
NR>1 && NR==FNR {
    a[$1];
    next
}
FNR>1 {
    print ($1 in a) ? $1 FS "Match" : $1 FS "In file2 but not in file1"
    delete a[$1]
}
END {
    for (x in a) {
        print x FS "In file1 but not in file2"
    }
}'file1 file2

但输出是:

"7938493",In file2 but not in file1
"2153187",In file2 but not in file1
"8172470",In file2 but not in file1
7938493,In file1 but not in file2
2153187,In file1 but not in file2
8172470,In file1 but not in file2

你能告诉我哪里出错了吗?

1 个答案:

答案 0 :(得分:2)

以下是对您的脚本的一些更正:

BEGIN {
    # FS = OFS = ","
    FS = "[,\"]+"
    OFS = ", "
}
# if (FNR==1){next}
FNR == 1 {next}

# NR>1 && NR==FNR {
NR==FNR {
    a[$1];
    next
}
# FNR>1 {
$2 in a {
    # print ($1 in a) ? $1 FS "Match" : $1 FS "In file2 but not in file1"
    print ($2 in a) ? $2 OFS "Match" : $2 "In file2 but not in file1"
    delete a[$2]
}
END {
    for (x in a) {
        print x, "In file1 but not in file2"
    }
}

这是一个awk脚本,因此您可以像awk -f script.awk file1 file2一样运行它。这样做可以得到以下结果:

$ awk -f script.awk file1 file2
736493, Match
763498, In file1 but not in file2
983264, In file1 but not in file2

您的脚本的主要问题是它没有正确处理file2中数字周围的双引号。我更改了输入字段分隔符,以便将双引号视为分隔符的一部分来处理此问题。因此,第二个文件中的第一个字段$1为空(它是行的开头和第一个"之间的位),因此您需要使用$2来请参阅您感兴趣的第一个值。除此之外,我从您的其他块中删除了一些冗余条件,并在您的第一个OFS语句中使用FS而不是print