比较两个CSV文件并仅显示差异

时间:2013-08-31 19:44:01

标签: perl shell

我有两个CSV个文件:

File1.csv

Time, Object_Name, Carrier_Name, Frequency, Longname

2013-08-05 00:00, Alpha, Aircel, 917.86, Aircel_Bhopal

2013-08-05 00:00, Alpha, Aircel, 915.13, Aircel_Indore

File2.csv

Time, Object_Name, Carrier_Name, Frequency, Longname

2013-08-05 00:00, Alpha, Aircel, 917.86, Aircel_Bhopal

2013-08-05 00:00, Alpha, Aircel, 815.13, Aircel_Indore

这些是实际的示例输入文件,所以很多标题和值都在那里,所以我不能对它们进行硬编码。

在我预期的输出中,我希望保留前两列和最后一列,因为它不会发生任何变化,然后对其余的列和值进行比较。

预期产出:

Time, Object_Name, Frequency, Longname

2013-08-05 00:00, 815.13, Aircel_Indore

我该怎么做?

4 个答案:

答案 0 :(得分:0)

答案 1 :(得分:0)

如果您未受Perl的约束,请使用AWK

进行解决方案
 #!/bin/bash

 awk -v FS="," '

 function filter_columns()
 {
     return sprintf("%s, %s, %s, %s", $1, $2, $(NF-1), $NF);
 }

 NF !=0 && NR == FNR {
    if (NR == 1) {
            print filter_columns();
    } else {
            memory[line++] = filter_columns();
    }
 } NF != 0 && NR != FNR {
    if (FNR == 1) {
            line = 0;
    } else {
            new_line = filter_columns();
            if (new_line != memory[line++]) {
                    print new_line;
            }
    }
 }' File1.csv File2.csv

输出:

Time,  Object_Name,  Frequany, Longname
2013-08-05 00:00,  Alpha,  815.13,  Aircel_Indore

这里的解释是:

#!/bin/bash

# FS = "," makes awk split each line in fields using
# the comma as separator
awk -v FS="," '

# this function selects the columns you want. NF is the
# the number of field. Therefore $NF is the content of
# the last column and $(NF-1) of the but last.
function filter_columns()
{
     return sprintf("%s, %s, %s, %s", $1, $2, $(NF-1), $NF);
}

# This block processes just the first file, this is the aim
# of the condition NR == FNR. The condition NF != 0 skips the
# empty lines you have in your file. The block prints the header
# and then save all the other lines in the array memory.
NF !=0 && NR == FNR {
    if (NR == 1) {
            print filter_columns();
    } else {
            memory[line++] = filter_columns();
    }
}
# This block processes just the second file (NR != FNR).
# Since the header has been already printed, it skips the first
# line of the second file (FNR == 1). The block compares each line
# against that one saved in the array memory (the corresponding
# line in the first file). The block prints just the lines
# that do not match.
NF != 0 && NR != FNR {
    if (FNR == 1) {
            line = 0;
    } else {
            new_line = filter_columns();
            if (new_line != memory[line++]) {
                    print new_line;
            }
    }
}' File1.csv File2.csv

答案 2 :(得分:0)

回答@ IlmariKaronen的问题会更好地澄清问题,但同时我做了一些假设并对问题进行了解决 - 主要是因为我需要借口学习一些Text :: CSV。

以下是代码:

#!/usr/bin/perl

use strict;
use warnings;

use Text::CSV;
use Array::Compare;
use feature 'say';

open my $in_file, '<', 'infile.csv';
open my $exp_file, '<', 'expectedfile.csv';

open my $out_diff_file, '>', 'differences.csv';

my $text_csv = Text::CSV->new({ allow_whitespace => 1, auto_diag => 1 });

my $line = readline($in_file);
my $exp_line = readline($exp_file);
die 'Different column headers' unless $line eq $exp_line;
$text_csv->parse($line);
my @headers = $text_csv->fields();

my %all_differing_indices;

#array-of-array containings lists of "expected" rows for differing lines
# only columns that differ from the input have values, others are empty
my @all_differing_rows; 

my $array_comparer = Array::Compare->new(DefFull => 1);
while (defined($line = readline($in_file))) {
    $exp_line = readline($exp_file);
    if ($line ne $exp_line) {
        $text_csv->parse($line);
        my @in_fields = $text_csv->fields();
        $text_csv->parse($exp_line);
        my @exp_fields = $text_csv->fields();

        my @differing_indices = $array_comparer->compare([@in_fields], [@exp_fields]);
        @all_differing_indices{@differing_indices} = (1) x scalar(@differing_indices);
        my @output_row = ('') x scalar(@exp_fields);
        @output_row[0, 1, @differing_indices, $#exp_fields] = @exp_fields[0, 1, @differing_indices, $#exp_fields];
        $all_differing_rows[$#all_differing_rows + 1] = [@output_row];
    }
}

my @columns_needed = (0, 1, keys(%all_differing_indices), $#headers);

$text_csv->combine(@headers[@columns_needed]);
say $out_diff_file $text_csv->string();
for my $row_aref (@all_differing_rows) {
    $text_csv->combine(@{$row_aref}[@columns_needed]);   
    say $out_diff_file $text_csv->string();
}

它适用于问题中给出的File1和File2并产生Expected输出(除了Object_Name'Alpha'出现在数据行中 - 我假设这是问题中的拼写错误)。

Time,Object_Name,Frequany,Longname
"2013-08-05 00:00",Alpha,815.13,Aircel_Indore

答案 3 :(得分:0)

我用非常强大的linux工具为它创建了一个脚本。 Link here...

Linux / Unix - 比较两个CSV文件 这个项目是关于两个csv文件的比较。

我们假设csvFile1.csv有XX列,而csvFile2.csv有YY列。

我写过的脚本应该将csvFile1.csv中的一个(键)列与csvFile2.csv中的另一个(键)列进行比较。来自csvFile1.csv的每个变量(来自键列的行)将与csvFile2.csv中的每个变量进行比较。

如果csvFile1.csv有1,500行,而csvFile2.csv有15,000个组合总数(比较)将是22,500,000。因此,这对于如何创建可用性报告脚本非常有用,例如可以将内部产品数据库与外部(供应商)产品数据库进行比较。

使用的套餐: csvcut(剪切列) csvdiff(比较两个csv文件) ssconvert(将xlsx转换为csv) 的iconv curlftpfs 压缩 拉开拉链 NTPD PROFTPD

您可以在我的官方博客上找到更多内容(+示例脚本): http://damian1baran.blogspot.sk/2014/01/linux-unix-compare-two-csv-files.html