Question

我有一系列.csv文件，其中包含由空格分隔的柱状（5列）数据。文件名的格式为“yyyymmdd.csv”。文件格式，例如如下所示：

20161201.csv的内容

key value more columns (this line (header) is absent)
123456 10000 some value
123457 20000 some value
123458 30000 some value

20161202.csv的内容

key value more columns (this line (header) is absent)
123456 10000 some value
123457 80000 some value
123458 30000 some value

20161203.csv的内容

key value more columns (this line (header) is absent)
123456 50000 some value
123457 70000 some value
123458 30000 some value

我想根据值列将具有日期“D”的文件与日期为“D + 1”的文件进行比较。然后我感兴趣的是那两个行数不同的连续文件。所以在这里，如果我将20161201.csv与20161202.csv进行比较，我只得到第二行不匹配

(123457 20000 some value and 123457 80000 some value, mismatched because of 20000 != 80000)

然后如果我将20161202.csv与20161203.csv进行比较，我会得到2行不匹配（第1行和第2行）

因此，20161202.csv和20161203.csv是我的目标文件。

我正在寻找一系列bash命令，它们可以做同样的事情。

PS：文件中的行数很大（大约3000），您可以假设所有文件都有相同的年份和月份（文件数量<30）。

Answer 1

不检查文件名是否符合日期比较规则（数据文件与日期+ 1文件），您可以这样做：

while IFS= read -r -d '' fn;do files+=("$fn");done < <(find . -name '201612*.csv' -print0) 
#Load all filenames in an array. Using null separation we ensure that filenames will be  
#handled correctly no matter if they do contain spaces or other special chars.

max=0
for ((i=0;i<"${#files[@]}"-1;i++));do #iterate through the filenames array
  a="${files[i]}";b="${files[i+1]}" #compare file1 with file2, file2 with file3, etc - in series
  differences=$(grep -v -Fw -f <(cut -d' ' -f2 "$a") <(cut -d' ' -f2 "$b") |wc -l)
  echo "comparing $a vs $b - non matching lines=$differences" #Just for testing - can be removed .
  [[ "$max" -lt "$differences" ]] && max="$differences" && ahold="$a" && bhold="$b" #When we have the max differences we keep the names of the files
done

echo "max differences found=$max between $ahold and $bhold" #reporting max differences and in which files found

获取两个文件之间不匹配行的核心是grep。您可以手动尝试grep以查看结果是否正确：

grep -v -F -w -f <(cut -d' ' -f2 file1) <(cut -d' ' -f2 file2)

grep选项：
-v：返回非匹配行（grep的反向操作）
-F：fixed -not regex - 匹配
-w：单词匹配以避免5000与50000匹配 -f：从文件加载模式，特别是从file1，field2加载模式。使用这种模式，我们将grep / search2文件2的字段。
wc -l：计算匹配=不匹配的行＆lt;（cut -d＆＃39; -f2 file2）：我们grep file2的field2而不是整个file2，以避免file2 / field2在file2的其他列中的匹配比column2

使用awk的替代解决方案

而不是grep，你可以使用这样的awk：

awk 'NR==FNR{a[$2];next}!($2 in a)' file1 file2

这将打印与grep -v

相同的结果

file1 / field2（$ 2）将加载到数组a中将打印不在此数组（非匹配字段）中的file2 / field2（$ 2）行。

也可以通过管道传递到|wc -l来计算不匹配的行，就像在grep中一样。

所以，如果您更喜欢使用awk，那么这一行：

differences=$(grep -v -Fw -f <(cut -d' ' -f2 "$a") <(cut -d' ' -f2 "$b") |wc -l)

必须更改为：

differences=$(awk 'NR==FNR{a[$2];next}!($2 in a)' $a $b |wc -l)

在任何情况下，您似乎需要一个数组来保存文件名，然后您需要一个循环来迭代文件并成对比较它们。

Answer 2

嗯，这是一种实施的挑战。

使用下面的代码，纯粹基于awk（实际上是gnu awk），我们所需要的只是一个起点/起始文件1。然后awk自动获取下一个文件2（通过添加1天）并比较这两个文件的不同行。

如果链中缺少文件，则脚本会重新调整files1和2的文件名，以通过遵守+1天的规则来检查相邻文件中的不同行。

您通常应该能够运行脚本，即使使用复制粘贴（即使包含注释也可以在我的bash中工作），或者您可以将代码保存在一个单独的文件（即test.awk）中，该文件将由awk加载-f开关（awk -f test.awk）。

awk -v file1="20161201.csv" \
'function incfile(file,days)                                        #function receives two arguments: file and days
    {
    match(file,/(....)(..)(..)/,fn);                                #splits the string of file to format fn[1]=YYYY,fn[2]=MM and fn[3]=DD
    newfile=sprintf("%s%s%02d%s",fn[1],fn[2],fn[3]+days,".csv");    #this function increase the filename by days variable
    return (newfile)                                                #i.e file 20161201.csv returns 20161201+days
    };
BEGIN \
{
    chkdays=1; 
    while (chkdays<=15)
    {
        {
        file2=incfile(file1,1);                                     #Built filename of file2 by increasing file1 +1 day
        if (getline < file2 < 0)                                    #Check if file2 exists
            {
            print file1,"vs",file2,"skipped:",file2 "  not found";  #Print a help message - can be removed
            chkdays=chkdays+2;                                      #increase days counter for the while loop by 2
            file1=incfile(file1,2);                                 #Increase filename of file1 by 2 days (20161201 will be 20161203)
            file2=incfile(file2,2);                                 #The same for filename of file2 (20161202 will be 20161204)
            }
        else                                                        #if file2 exists
            {
            close(file2);                                           
            print "comparing",file1,"vs",file2; 
            while (getline var <file1)                              #read from file1 a line and assign it to var
                {split(var,ff1,OFS);a[ff1[2]]};                     #split line from file 2 (var) to fields, and keep the field2 in an array as index
            while (getline var2 <file2)
                {
                split(var2,ff2,OFS);                                #same for file2.split the line read (var2) 
                if (!(ff2[2] in a)) {print ">",var2;l=l+1};         #check if ff2[2] (file2-field2) is not found on the array created by file1-field2
                }
            if (l>maxd) {maxd=l;maxp=file1 " vs " file2};           #hold/save max different lines found and hold also the files that maxd was found
            file1=file2;                                            #Assign file2 to be file1 in order to repeat the loop
            chkdays=chkdays+1;                                      #Increase check days counter by 1
            delete a;l=0;close(file1);close(file2)                  #unset all necessary vars and close files
            }
        }
    };                                                              #End of BEGIN section
    print "max different lines=",maxd,"found at pair:",maxp         #Print the results
}'                                                                  #Finished

输出：

comparing 20161201.csv vs 20161202.csv
> 123457 80000 some value
comparing 20161202.csv vs 20161203.csv
> 123456 50000 some value
> 123457 70000 some value
20161203.csv vs 20161204.csv skipped: 20161204.csv  not found
20161205.csv vs 20161206.csv skipped: 20161206.csv  not found
20161207.csv vs 20161208.csv skipped: 20161208.csv  not found
20161209.csv vs 20161210.csv skipped: 20161210.csv  not found
comparing 20161211.csv vs 20161212.csv
> 123457 80000 some value
> 123458 15000 some value
> 123458 16000 some value
> 123458 17000 some value
comparing 20161212.csv vs 20161213.csv
> 123456 50000 some value
> 123457 70000 some value
> 123458 20000 some value
> 123458 25000 some value
> 123458 35000 some value
20161213.csv vs 20161214.csv skipped: 20161214.csv  not found
comparing 20161215.csv vs 20161216.csv
max different lines= 5 found at pair: 20161212.csv vs 20161213.csv

$ cat 20161212.csv
123456 10000 some value
123457 80000 some value
123458 30000 some value
123458 15000 some value
123458 16000 some value
123458 17000 some value

$ cat 20161213.csv
123456 50000 some value
123457 70000 some value
123458 20000 some value
123458 15000 some value
123458 25000 some value
123458 35000 some value

# csv files 01,02,03 are copy paste from your OP. file 11 is a copy of file 01.

PS：您可以删除awk的所有打印部分，并仅保留最后一个汇总命令。

希望这段代码能够有所帮助并且运作良好。

如何获得两个文件在一系列文件中有最大差异

2 个答案:

使用awk的替代解决方案