我有一个输入字符串:
ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6
如果我使用分割功能,我会得到奇怪的输出。
my ($field1, $field2, $field3, $field4) = "";
while (<DATAFILE>) {
$row = $_;
$row =~ s/\r?\n$//;
($field1, $field2, $field3, $field4) = split(/,/, $row);
}
我得到的输出是:
field1 :: ACC000121
field2 :: 2290
field3 :: "01009900
field4 :: 01009901
预期产出:
field1 = ACC000121
field2 = 2290
field3 = 01009900,01009901,01009902,01009903,01009904
field4 = 4
field5 = 5
field6 = 6
我在Perl中很弱。请帮帮我
答案 0 :(得分:4)
如果你有CSV数据,你真的想用Text::CSV来解析它。正如您所发现的那样,解析CSV数据通常不像在逗号上拆分那么简单,而Text :: CSV可以为您处理所有边缘情况。
use strict;
use warnings;
use Data::Dump;
use Text::CSV;
my $csv = Text::CSV->new;
while (<DATA>) {
$csv->parse($_);
my @fields = $csv->fields;
dd(\@fields);
}
__DATA__
ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6
输出:
[
"ACC000121",
2290,
"01009900,01009901,01009902,01009903,01009904",
4,
5,
6,
]
答案 1 :(得分:0)
我同意Matt Jacob的answer - 你应该用Text::CSV解析CSV,除非你有充分的理由不这样做。
如果您打算使用正则表达式处理它,我认为m//
比split
做得更好。例如,这似乎涵盖了大多数单行CSV数据变体,但它不会删除引号字段周围的引号Text::CSV
- 这需要单独的后处理步骤。
use strict;
use warnings;
sub splitter
{
my($row) = @_;
my @fields;
my $i = 0;
while ($row =~ m/((?=,)|[^",][^,]*|"([^"]|"")*")(?:,|$)/g)
{
print "Found [$1]\n";
$fields[$i++] = $1;
}
for (my $j = 0; $j < @fields; $j++)
{
print "$j = [$fields[$j]]\n";
}
}
my $row;
$row = q'ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6';
print "Row 1: $row\n";
splitter($row);
$row = q'ACC000121,",",2290,"01009900,""aux data"",01009902,01009903,01009904",,5"abc",6,""';
print "Row 2: $row\n";
splitter($row);
显然,其中有相当数量的诊断代码。输出(来自Mac OS X 10.11.1上的Perl 5.22.0)是:
Row 1: ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6
Found [ACC000121]
Found [2290]
Found ["01009900,01009901,01009902,01009903,01009904"]
Found [4]
Found [5]
Found [6]
0 = [ACC000121]
1 = [2290]
2 = ["01009900,01009901,01009902,01009903,01009904"]
3 = [4]
4 = [5]
5 = [6]
Row 2: ACC000121,",",2290,"01009900,""aux data"",01009902,01009903,01009904",,5"abc",6,""
Found [ACC000121]
Found [","]
Found [2290]
Found ["01009900,""aux data"",01009902,01009903,01009904"]
Found []
Found [5"abc"]
Found [6]
Found [""]
0 = [ACC000121]
1 = [","]
2 = [2290]
3 = ["01009900,""aux data"",01009902,01009903,01009904"]
4 = []
5 = [5"abc"]
6 = [6]
7 = [""]
在Perl代码中,匹配为:
m/((?=,)|[^",][^,]*|"([^"]|"")*")(?:,|$)/
这会查找并捕获(在$1
中)空字段后跟逗号,或者双引号后跟零或多个非逗号,或双引号后跟a “不是双引号或两个连续双引号”的零次或多次发生的序列和另一个双引号;然后它需要逗号或字符串结尾。
处理多行字段需要更多工作。删除转义双引号也需要更多的工作。
使用Text::CSV
更简单,更不容易出错(并且它可以处理比这更多的变体)。