使用Regex修改CSV中的特定列

时间:2012-03-11 18:41:45

标签: regex perl r csv

我希望将CSV中的一些字符串转换为00-24-2400小时格式,格式为00-24-24小时。 e.g。

2011-01-01,"AA",12478,31703,12892,32575,"0906",-4.00,"1209",-26.00,2475.00
2011-01-02,"AA",12478,31703,12892,32575,"0908",-2.00,"1236",1.00,2475.00
2011-01-03,"AA",12478,31703,12892,32575,"0907",-3.00,"1239",4.00,2475.00

第7列和第9列分别是出发和到达时间。当我完成时,线条最好应该是这样的:

2011-01-01,"AA",12478,31703,12892,32575,"09",-4.00,"12",-26.00,2475.00

整个csv最终将导入到R中,我想事先尝试处理一些处理,因为它会有点大。我最初尝试使用Perl来做这个,但我在使用正则表达式挑选多个数字时遇到了麻烦。我可以在给定逗号之前获得一个带有lookbehind表达式的单个数字,但不能超过一个。

我也很容易被告知在Perl中这样做是不必要的傻事我应该坚持R.:)

2 个答案:

答案 0 :(得分:3)

我也可以提供我自己的解决方案,这是

s/"(\d\d)\d\d"/"$1"/g

答案 1 :(得分:2)

就像我在评论中提到的那样,使用像Text::CSV这样的CSV模块是一个安全的选择。这是一个如何使用它的快速示例脚本。你会注意到它不会保留引号,尽管它应该,因为我放入了keep_meta_info。如果它对你很重要,我相信有办法解决它。

use strict;
use warnings;
use Data::Dumper;

use Text::CSV;
my $csv = Text::CSV->new({
        binary => 1,
        eol => $/,
        keep_meta_info => 1,
});
while (my $row = $csv->getline(*DATA)) {
    for ($row->[6], $row->[8]) {
        s/\d\d\K\d\d//;
    }
    $csv->print(*STDOUT, $row);
}

__DATA__
2011-01-01,"AA",12478,31703,12892,32575,"0906",-4.00,"1209",-26.00,2475.00
2011-01-02,"AA",12478,31703,12892,32575,"0908",-2.00,"1236",1.00,2475.00
2011-01-03,"AA",12478,31703,12892,32575,"0907",-3.00,"1239",4.00,2475.00

<强>输出:

2011-01-01,AA,12478,31703,12892,32575,09,-4.00,12,-26.00,2475.00
2011-01-02,AA,12478,31703,12892,32575,09,-2.00,12,1.00,2475.00
2011-01-03,AA,12478,31703,12892,32575,09,-3.00,12,4.00,2475.00