我正在用perl编写脚本。但陷入了困境。以下是我的csv文件示例。
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"GJ","919904303790","20150806125002","prepaid","prepaid","2G","3G"
"MH","919921990805","20150806125003","prepaid","prepaid","2G"
"MP","918120197922","20150806125004","prepaid","prepaid","2G"
"MUM","919904303790","20150806125005","prepaid","prepaid","2G","3G"
"MUM","918652624178","20150806125005","","prepaid","","2G","NEW"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
现在我需要根据第二列(即手机号码)获取唯一记录,但只考虑第三列的最新值(即时间戳) 例如:手机号码“918120197922”。
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"MP","918120197922","20150806125004","prepaid","prepaid","2G"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
它应该选择第3条记录,因为它具有最新的时间戳值(20150806125005)。请帮忙。
其他信息: 对不起数据不一致..我现在已经纠正了。 是数据是按顺序,这意味着最新时间戳将出现在最新行中。 还有一件事我的文件大小超过1 GB,那么有没有办法有效地做到这一点?在这种情况下,awk的工作速度是否比perl快。请帮帮忙?
答案 0 :(得分:3)
使用Text::CSV处理CSV文件。
通过第2列散列行,只保留最新的行。
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV;
my $csv = 'Text::CSV'->new() or die 'Text::CSV'->error_diag;
my %hash;
open my $CSV, '<', '1.csv' or die $!;
while (my $row = $csv->getline($CSV)) {
my ($number, $timestamp) = @$row[1, 2];
# Store the row if the timestamp is more recent than the stored one.
$hash{$number} = $row if $timestamp gt ($hash{$number}[2] || q());
}
$csv->eol("\n");
$csv->always_quote(1);
open my $OUT, '>', 'uniq.csv' or die $!;
for my $row (values %hash) {
$csv->print($OUT, $row);
}
close $OUT or die $!;
答案 1 :(得分:0)
如果您知道您的数据是按时间戳排序的,那么您可以利用它并向后阅读它们并将您的任务转换为问题,以输出每个电话号码的第一次出现。
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
use constant PHONENUM_FIELD => 1;
my $filename = shift;
die "Usage: $0 <filename>\n" unless defined $filename;
open my $in, '-|', 'tac', $filename;
my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1, eol => $/ } );
my %seen;
while ( my $row = $csv->getline($in) ) {
$csv->print( *STDOUT, $row ) unless $seen{ $row->[PHONENUM_FIELD] }++;
}
如果您希望输出与输入的顺序相同,您也可以写入tac
:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
use constant PHONENUM_FIELD => 1;
my $filename = shift;
die "Usage: $0 <filename>\n" unless defined $filename;
open my $in, '-|', 'tac', $filename;
open my $out, '|-', 'tac';
my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1, eol => $/ } );
my %seen;
while ( my $row = $csv->getline($in) ) {
$csv->print( $out, $row ) unless $seen{ $row->[PHONENUM_FIELD] }++;
}
任何体面的硬件都不应该是1GB的问题。在我的旧笔记本上,处理29360128行和1.8GB需要2m3.393s。它超过230krows / s但是YMMV。如果您有兴趣在输出处获得引用的所有值,请将always_quote => 1
添加到$csv
构造函数参数。