两个csv文件:用另一个csv更改并拉出该行

时间:2015-08-14 12:09:22

标签: perl csv awk grep

我有两个CSV文件。第一个是列表文件,它包含ID和名称。例如

1127100,Acanthocolla cruciata  
1127103,Acanthocyrta haeckeli  
1127108,Acanthometra fusca 

第二个是我要交换的内容,如果找到匹配,则按第一个数字提取行。第一列数字在每个文件中对应。例如

1127108,1,0.60042  
1127103,1,0.819671  
1127100,2,0.50421,0.527007  
10207,3,0.530422,0.624466   

所以我想最终得到像这样的CSV文件

Acanthometra fusca,1,0.60042  
Acanthocyrta haeckeli,1,0.819671  
Acanthocolla cruciata,2,0.50421,0.527007

我试过Perl但是一次打开两个文件就被证明是凌乱的。所以我尝试将其中一个CSV文件转换为字符串并以这种方式解析,但没有用。但后来我正在阅读有关grep和其他单行的内容,但我并不熟悉它。用grep会这可能吗?

这是我试过的Perl代码

use strict;
use warnings;

 open my $csv_score, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[0]" for input: $!\n};
 open my $csv_list,  '<', "$ARGV[1]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
 open my $out, ">$ARGV[0]_final.txt" or die qq{Failed to open for output: $!\n};

  my $string = <$csv_score>;

  while ( <$csv_list> ) {

    my ($find, $replace) = split /,/; 
    $string =~ s/$find/$replace/g;

         if ($string =~ m/^$replace/){
         print $out $string;
      }
  }

close $csv_score;
close $csv_list;
close $out;

3 个答案:

答案 0 :(得分:2)

您的代码失败了,因为您只读取了$csv_score文件中的第一行,并且每次更改时都尝试打印$string。您也无法从$csv_list文件的行末尾删除换行符。如果您修复了这些内容,那么它就像这样

use strict;
use warnings;

open my $csv_score, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[0]" for input: $!\n};
open my $csv_list, '<', "$ARGV[1]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open my $out, ">$ARGV[0]_final.txt" or die qq{Failed to open for output: $!\n};

my $string = do {
    local $/;
    <$csv_score>;
};

while ( <$csv_list> ) {
    chomp;

    my ( $find, $replace ) = split /,/;
    $string =~ s/$find/$replace/g;
}

print $out $string;

close $csv_score;
close $csv_list;
close $out;

输出

Acanthometra fusca,1,0.60042
Acanthocyrta haeckeli,1,0.819671
Acanthocolla cruciata,2,0.50421,0.527007
10207,3,0.530422,0.624466

然而,这不是一种安全的做事方式,因为ID可以在别处找到而不是在行的开头

我会像这样在$csv_list文件中构建一个哈希,这也使程序更简洁

use strict;
use warnings;
use v5.10.1;
use autodie;

my %ids;
{
    open my $fh, '<', $ARGV[1];
    while ( <$fh> ) {
        chomp;
        my ($id, $name) = split /,/;
        $ids{$id} = $name;
    }
}

open my $in_fh,  '<',  $ARGV[0];
open my $out_fh, '>', "$ARGV[0]_final.txt";

while ( <$in_fh> ) {
    s{^(\d+)}{$ids{$1} // $1}e;
    print $out_fh $_;
}

输出与上面第一个程序的输出相同

答案 1 :(得分:2)

编写代码的问题是你只执行一次:

my $string = <$csv_score>;

这会从$csv_score读取一行,而您无法使用其余内容。

我建议你需要:

  • 将第一个文件读入哈希
  • 迭代第二个文件,并在第一列上执行替换。
  • 使用Text::CSV通常是处理它的好主意,但对于您的示例来说,似乎

所以:

#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
use Data::Dumper;

my $csv = Text::CSV->new( { binary => 1 } );

my %replace;

while ( my $row = $csv->getline( \*DATA ) ) {
    last if $row->[0] =~ m/NEXT/;
    $replace{ $row->[0] } = $row->[1];
}

print Dumper \%replace;

my $search = join( "|", map {quotemeta} keys %replace );
$search =~ qr/($search)/;

while ( my $row = $csv->getline( \*DATA ) ) {
    $row->[0] =~ s/^($search)$/$replace{$1}/;
    $csv->print( \*STDOUT, $row );
    print "\n";
}

__DATA__
1127100,Acanthocolla cruciata  
1127103,Acanthocyrta haeckeli  
1127108,Acanthometra fusca 
NEXT
1127108,1,0.60042  
1127103,1,0.819671  
1127100,2,0.50421,0.527007  
10207,3,0.530422,0.624466 

注意 - 这仍会打印源内容的最后一行:

"Acanthometra fusca ",1,"0.60042  "
"Acanthocyrta haeckeli  ",1,"0.819671  "
"Acanthocolla cruciata  ",2,0.50421,"0.527007  "

(您的数据包含空格,因此Text::CSV将其包装在引号中)

如果你想丢弃它,那么你可以测试替换是否实际发生:

if ( $row->[0] =~ s/^($search)$/$replace{$1}/ ) {
    $csv->print( \*STDOUT, $row );
    print "\n";
}

(当然,如果你确定你没有split /,/通常支持的任何重大事情,你可以继续使用CSV

答案 2 :(得分:2)

我想提供一种非常不同的方法。

让我们说你对数据库比对Perl的数据结构感觉更舒服。您可以使用DBD::CSV将CSV文件转换为关系型数据库。它使用了引擎盖下的Text :: CSV(帽子提示为@Sobrique)。您需要从CPAN安装它,因为它没有捆绑在默认的DBI发行版中。

use strict;
use warnings;
use Data::Printer; # for p
use DBI;

my $dbh = DBI->connect( "dbi:CSV:", undef, undef, { f_ext => '.csv' } );
$dbh->{csv_tables}->{names}   = { col_names => [qw/id name/] };
$dbh->{csv_tables}->{numbers} = { col_names => [qw/id int float/] };

my $sth_select = $dbh->prepare(<<'SQL');
SELECT names.name, numbers.int, numbers.float
FROM names
JOIN numbers ON names.id = numbers.id
SQL

# column types will be silently discarded
$dbh->do('CREATE TABLE result ( name CHAR(255), int INTEGER, float INTEGER )');
my $sth_insert = 
  $dbh->prepare('INSERT INTO result ( name, int, float ) VALUES ( ?, ?, ? ) ');

$sth_select->execute;
while (my @res = $sth_select->fetchrow_array ) {
  p @res;
  $sth_insert->execute(@res);
}

这样做是为两个(您的CSV文件)设置列名,因为它们没有带名称的第一行。我根据数据类型创建了名称。然后,它将创建一个名为result的新(CSV文件),并通过一次写入一行来填充它。

同时,它会将数据(用于调试目的)输出到STDERRData::Printer

[
    [0] "Acanthocolla cruciata",
    [1] 2,
    [2] 0.50421
]
[
    [0] "Acanthocyrta haeckeli",
    [1] 1,
    [2] 0.819671
]
[
    [0] "Acanthometra fusca",
    [1] 1,
    [2] 0.60042
]

生成的文件如下所示:

$ cat scratch/result.csv 
name,int,float
"Acanthocolla cruciata",2,0.50421
"Acanthocyrta haeckeli",1,0.819671
"Acanthometra fusca",1,0.60042