我想编写一个Perl脚本:
例如,我有一个包含以下信息的CSV文件:
"101","5555555555","DOE, JOHN "," DOE, JOHN, your trip
tomorrow from, 123 Anywhere St Apt #A, to, 100 ELSEWHERE RD APT E, is
scheduled for pickup between, 1:00 PM, and 1:30 PM"
"102","5555555555","DOE, JOHN "," DOE, JOHN, your trip
tomorrow from, 100 ELSEWHERE RD APT E, to, 123 Anywhere St Apt #A, is
scheduled for pickup between, 9:00 PM, and 9:30 PM"
我希望脚本能够读取,解析和检测第二个字段(“5555555555”)的重复值,然后创建一个新的CSV文件,将上述记录合并为一个记录:
"101","5555555555","DOE, JOHN "," DOE, JOHN, your trip
tomorrow from, 123 Anywhere St Apt #A, to, 100 ELSEWHERE RD APT E, is
scheduled for pickup between, 1:00 PM, and 1:30 PM AND your trip
tomorrow from, 100 ELSEWHERE RD APT E, to, 123 Anywhere St Apt #A, is
scheduled for pickup between, 9:00 PM, and 9:30 PM"
我当前的Perl代码成功检测,读取和解析文件,但是,我在如何检测重复项并组合行时迷失了方向。
#!
use strict;
use warnings;
use File::Find;
use Text::CSV;
$| = 1;
use constant {
#Check for CSV files only
SUFFIX_LIST => qr/\.(csv)$/,
DIR_TO_CHECK => "/Users/Me/Desktop/INBOUND/",
};
my @file_list;
while (1) {
#Recursively search the input directory for CSV files
find ( sub {
return unless -f;
return unless $_ =~ SUFFIX_LIST;
#Make sure all of the files in the file list array are unique
if(!(grep(/^$_$/, @file_list))) {
push @file_list, $File::Find::name;
}
}, DIR_TO_CHECK
);
#If .csv files are found...
if (scalar(@file_list) > 0) {
print "\nNew Item in Directory\n";
parseFile($file_list[0]);
#Delete input file
unlink $file_list[0];
print "Deleted File\n";
#Remove the file from the file list
shift @file_list;
} else {
print "No New Item\n";
}
sleep 5;
}
#Subroutine to parse and compare the csv file
sub parseFile() {
my $csv = Text::CSV->new({ sep_char => ',',
always_quote => 1,
quote_char => '"',
escape_char => '"',
binary => 1,
auto_diag => 1});
#Get the file that was passed to the function
my $file = $_[0] or die "CSV file not passed in subroutine\n";
#Open file for reading
open(my $data, '<', $file) or die "Could not open '$file' $!\n";
while (my $line = <$data>) {
print $line;
if ($csv->parse($line)) {
my @fields = $csv->fields();
} else {
#warn "Line could not be parsed: $line\n";
Text::CSV->error_input();
}
}
close $data;
}
我认为我所寻找的功能有什么不对,因为我怀疑我需要将文件作为一个整体读入内存,而不是逐行读取。请帮助,谢谢。
答案 0 :(得分:0)
这一天我不是perl,但这是我的答案。使用第二个字段作为键创建哈希表。像这样。
%hashtbl{555555} = {
id => 102, # first field
names => "doe, john", # third field
msg => "DOE, JOHN, your trip..." # last field
};
如果密钥已存在于哈希表中,则附加其msg
if(exists $hashtbl[$KEY])
$hashtbl{$KEY}->{msg} .= "AND $last_field"
读完整个文件后,使用此哈希表创建一个新的csv文件。
答案 1 :(得分:0)
这样的事情应该有效。
它并不完美,但它应该会有很大的提升。例如,您需要添加一些垃圾来删除展平描述列中的额外名称。
my $data = parseFile($path);
flatten_record($_) for @$data;
writeFile($newpath, $data);
sub csv_cols { qw/ id phone name desc / ) }
sub get_csv {
my $csv = Text::CSV->new({
sep_char => ',',
always_quote => 1,
quote_char => '"',
escape_char => '"',
binary => 1,
auto_diag => 1
});
}
#Subroutine to parse csv file
sub parseFile() {
my ($file) = @_;
die "CSV file not passed in subroutine\n"
unless $file;
my $csv = get_csv();
#Open file for reading
open(my $fh, '<', $file)
or die "Could not open '$file' $!\n";
$csv->column_names( csv_cols() );
# make hash of arrays containing
my %by_phone;
for my $row ( @{$csv->getline_hr_all($fh)} ) {
my $phone = $row->{phone}
$by_phone{$phone} = [] unless $by_phone{$phone};
push @{$by_phone{$phone}}, $row;
}
return [ values %by_phone ];
}
sub flatten_record {
my ($record) = @_;
die "Empty record." if @$record == 0;
if ( @$record == 1 ) {
$record = $record->[0];
} else {
$record = {
id => $record->[0]{id},
phone => $record->[0]{phone},
name => $record->[0]{name},
desc => "$record->[0]{desc} AND $record->[1]{desc}",
};
}
return $record;
}
sub writeFile {
my ( $path, $data ) = @_;
open my $fh, ">", $path
or die "Error opening '$path' for writing- $!\n";
my $csv = get_csv();
for my $record ( $data ) {
my @row = @{$record}{ csv_cols() };
$csv->print( $fh, \@row );
}
}