Question

我正在解析带有嵌入式逗号的CSV文件，显然，使用split（）会因此而受到一些限制。

我应该注意的一点是，嵌入逗号的值被括号，双引号或两者包围......

例如：

（日期，名义）， “日期，名义”， “（日期，名义）”

另外，我试图在不使用任何模块的情况下这样做，出于某些原因，我现在不想进入...

任何人都可以帮我解决这个问题吗？

Answer 1

这应该做你需要的。它的工作方式与Text::CSV_PP中的代码非常相似，但不允许字段中的转义字符，因为您说没有

use strict;
use warnings;
use 5.010;

my $re = qr/(?| "\( ( [^()""]* ) \)" |  \( ( [^()]* ) \) |  " ( [^"]* ) " |  ( [^,]* ) ) , \s* /x;

my $line = '(Date, Notional 1), "Date, Notional 2", "(Date, Notional 3)"';

my @fields = "$line," =~ /$re/g;

say "<$_>" for @fields;

<强>输出

<Date, Notional 1>
<Date, Notional 2>
<Date, Notional 3>

<强>更新

这是旧版Perls（版本10之前版本）的版本，它没有正则表达式分支重置构造。它产生与上述相同的输出

use strict;
use warnings;
use 5.010;

my $re = qr/(?: "\( ( [^()""]* ) \)" |  \( ( [^()]* ) \) |  " ( [^"]* ) " |  ( [^,]* ) ) , \s* /x;

my $line = '(Date, Notional 1), "Date, Notional 2", "(Date, Notional 3)"';

my @fields = grep defined, "$line," =~ /$re/g;

say "<$_>" for @fields;

Answer 2

我知道你已经有了Borodin的答案，但是为了记录，还有一个简单的解决方案（见online demo底部的结果）。这种情况听起来与regex match a pattern unless...非常相似。

#!/usr/bin/perl
$regex = '(?:\([^\)]*\)|"[^"]*")(*SKIP)(*F)|\s*,\s*';
$subject = '(Date, Notional), "Date, Notional", "(Date, Notional)"';
@splits = split($regex, $subject);
print "\n*** Splits ***\n";
foreach(@splits) { print "$_\n"; }

工作原理

交替|的左侧匹配完成(parentheses)和(quotes)，然后故意失败。右侧与逗号匹配，我们知道它们是正确的逗号，因为它们与左侧的表达式不匹配。

可能的优化

如果需要，可以递归parenthess匹配部分以匹配(nested(parens))

参考

How to match (or replace) a pattern except in situations s1, s2, s3...

Perl解析带有嵌入式逗号的CSV文件

2 个答案: