Question

我有一个包含以下行格式的大型CSV文件：

c1,c2

我想将原始文件拆分为两个文件，如下所示：

一个文件将包含c1值恰好在文件中出现一次的行。
另一个文件将包含c1值在文件中出现两次或更多次的行。

知道如何做到这一点吗？

例如，如果原始文件是：

1,foo
2,bar
3,foo
4,bar
2,foo
1,bar

我想生成以下文件：

3,foo
4,bar

和

1,foo
2,bar
2,foo
1,bar

Answer 1

这个单行生成两个文件o1.csv and o2.csv

awk -F, 'NR==FNR{a[$1]++;next}{print >"o"(a[$1]==1?"1":"2")".csv"}' file file

试验：

kent$  cat f
1,foo
2,bar
3,foo
4,bar
2,foo
1,bar

kent$  awk -F, 'NR==FNR{a[$1]++;next}{print >"o"(a[$1]==1?"1":"2")".csv"}' f f

kent$  head o*
==> o1.csv <==
3,foo
4,bar

==> o2.csv <==
1,foo
2,bar
2,foo
1,bar

注意

awk两次读取文件，而不是将整个文件保存在内存中
保留文件的顺序

Answer 2

根据你的意思，这可能适合你。它必须在关联数组中保持线，直到它看到第二次使用，或直到文件结束。当看到第二次使用时，记住了数据更改为＆＃34;！＆＃34;避免在第3次及以后的比赛中再次打印。

>file2
awk -F, '
{ if(done[$1]!=""){
    if(done[$1]!="!"){
     print done[$1]
     done[$1] = "!"
    }
    print
  }else{ 
   done[$1] = $0
   order[++n] = $1
  }
}
END{
  for(i=1;i<=n;i++){
   out = done[order[i]]
   if(out!="!")print out >>"file2"
  }
}
' <csvfile >file1

Answer 3

我为此工作打破了Perl

#!/usr/bin/env perl

use strict; 
use warnings;

my %count_of;
my @lines; 

open ( my $input, '<', 'your_file.csv' ) or die $!; 

#read the whole file
while ( <$input> ) {
   my ( $c1, $c2 ) = split /,/;
   $count_of{$c1}++; 
   push ( @lines, [ $c1 , $c2 ] ); 
}
close ( $input ); 

print "File 1:\n";
#filter any single elements
foreach my $pair ( grep { $count_of{$_ -> [0]} < 2 } @lines ) {
    print join (",", @$pair );
}

print "File 2:\n"; 
#filter any repeats. 
foreach my $pair ( grep { $count_of{$_ -> [0]} > 1 } @lines ) {
    print join (",", @$pair );
}

这会将整个文件保存在内存中，但是根据您的数据 - 您不会通过对其进行双重处理并保持计数来节省大量空间。

但是你可以做：

#!/usr/bin/env perl

use strict;
use warnings;

my %count_of;

open( my $input, '<', 'your_file.csv' ) or die $!;

#read the whole file counting "c1"
while (<$input>) {
    my ( $c1, $c2 ) = split /,/;
    $count_of{$c1}++;
}

open( my $output_single, '>', "output_uniques.csv" ) or die $!;
open( my $output_dupe,   '>', "output_dupes.csv" )   or die $!;

seek( $input, 0, 0 );
while ( my $line = <$input> ) {
    my ($c1) = split( ",", $line );
    if ( $count_of{$c1} > 1 ) {
        print {$output_dupe} $line;
    }
    else {
        print {$output_single} $line;
    }
}

close($input);
close($output_single);
close($output_dupe);

这将通过仅保留计数来最小化内存占用 - 它首先读取文件以计算c1值，然后再次处理它并将行打印到不同的输出。

根据列值基数拆分大型CSV文件

3 个答案:

注意