使用换行符和其他逗号解析csv文件

时间:2015-09-23 20:06:48

标签: parsing csv

我当前的csv文件如下所示:

field1, field1, field3, field4, field5, field6  
111, John, Doctor, 1A-jrd, ,"Tuft St  
Peoria, IL 54345  
(12.11111, 43.5555)"  
121, Bob, Teacher, 2A-abcd, 345, "Moore Ave  
Boston, MA 23123  
(67.11111,- 49.5567)"  
131, Kyle, Engineer, 3A-bhbh, , "Barnes St  
San Francisco, CA 34654  
(65.11111, 55.985432)"  

在某些情况下,field5没有值。此外,field6在引号内并具有换行符。例如:第一行数据的field6实际上是

"Tuft St  
Peoria, IL 54345  
(12.11111, 43.5575)"  

我需要编写一个脚本来解析这个文件并返回12.111,43.557来代替field6的当前值,这样最终的csv文件看起来就像

field1, field1, field3, field4, field5, field6  
111, John, Doctor, 1A-jrd, , "12.111, 43.555"  
121, Bob, Teacher, 2A-abcd, 345, "67.111,- 49.556"  
131, Kyle, Engineer, 3A-bhbh, , "65.111, 55.985"  

我看过cvsparser,但我的理解是,只有当整个数据行在一行上没有任何换行符时它才有效。此外,我不能简单地使用逗号分割行,因为有些地址中有多个逗号。关于如何解析这个csv文件的任何建议?

4 个答案:

答案 0 :(得分:1)

对于这种“非结构化csv”格式,您可以使用Marpa::R2,一个Perl接口Marpa,一般的BNF解析器。

数据可以在BNF中描述为this ::= that~运算符定义词法规则)。 ::=规则中的Parens,例如(header [\n])表示“不包含在解析结果中。”

解析器返回一个数据结构([ id, child1, child2 ... ]格式的数组数组),从中可以提取数据。

您还可以在同一个或单独的包中将semantic actions定义为Perl sub来处理数据。

示例脚本及其输出(基于您的数据)如下所示。

脚本:

use 5.010;
use strict;
use warnings;

use Data::Dumper;
$Data::Dumper::Indent = 1;
$Data::Dumper::Terse = 1;
$Data::Dumper::Deepcopy = 1;

use Marpa::R2;

my $g = Marpa::R2::Scanless::G->new( { source => \(<<'END_OF_SOURCE'),

    :default ::= action => [ name, value]
    lexeme default = action => [ name, value] latm => 1

    csv ::= (header [\n]) lines

    header ::= column+ separator => column_sep
    column_sep ~ ', '
    column ~ 'field' [1-6]

    lines       ::= line+ separator => [\n]
    line        ::= fields1_5 (',') field6
    field_sep   ~ ','
    fields1_5   ::= field1_5+ separator => field_sep
    field1_5    ~ num | word | code
    field6      ~ address

    num ~ [\d]+
    word ~ [A-Za-z]+
    code  ~ num word '-' word
    address ~ '"' address_chars '"'
    address_chars ~ [^\"]+ #"

    :discard ~ space
    space ~ ' '

END_OF_SOURCE
} );

my $input = <<EOI;
field1, field1, field3, field4, field5, field6
111, John, Doctor, 1A-jrd, ,"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"
121, Bob, Teacher, 2A-abcd, 345, "Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"
131, Kyle, Engineer, 3A-bhbh, , "Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"
EOI

say Dumper ${ $g->parse( \$input, { trace_terminals => 0 } ) };

输出:

[
  'csv',
  [
    'lines',
    [
      'line',
      [
        'fields1_5',
        [
          'field1_5',
          '111'
        ],
        [
          'field1_5',
          'John'
        ],
        [
          'field1_5',
          'Doctor'
        ],
        [
          'field1_5',
          '1A-jrd'
        ]
      ],
      [
        'field6',
        '"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"'
      ]
    ],
    [
      'line',
      [
        'fields1_5',
        [
          'field1_5',
          '121'
        ],
        [
          'field1_5',
          'Bob'
        ],
        [
          'field1_5',
          'Teacher'
        ],
        [
          'field1_5',
          '2A-abcd'
        ],
        [
          'field1_5',
          '345'
        ]
      ],
      [
        'field6',
        '"Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"'
      ]
    ],
    [
      'line',
      [
        'fields1_5',
        [
          'field1_5',
          '131'
        ],
        [
          'field1_5',
          'Kyle'
        ],
        [
          'field1_5',
          'Engineer'
        ],
        [
          'field1_5',
          '3A-bhbh'
        ]
      ],
      [
        'field6',
        '"Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"'
      ]
    ]
  ]
]

答案 1 :(得分:0)

你做不到。在字段6中允许使用逗号时,这是一个有效的文件

A,B,C,d ,,È   A,B,C,d,E,F

你无法确定这个文件是否包含一个或两个条目,因为第一个数据集的字段6可以是'E'或'E \ na,b,c,d,e,f'

答案 2 :(得分:0)

您可以将csv库用于此

import csv

with open('myfile.csv') as myfile:
     csv_file = csv.reader(myfile, delimiter = ',')

现在你有了行,做任何你想做的事。

答案 3 :(得分:0)

您需要针对该类数据的CSV解析器。我建议使用perl和Text :: CSV:

这样的事情:

#!/usr/bin/env perl
use strict;
use warnings;

use Text::CSV; 

my $csv = Text::CSV -> new( { 'binary' => 1, eol => "\n" } ); 

open ( my $input_fh, '<', "sample.csv" ) or die $!; 

my $header = $csv -> getline ( $input_fh );
$csv -> print ( \*STDOUT, $header );

while ( my $row = $csv -> getline ( $input_fh ) ) { 
    $row -> [5] =~ s,.*\(,\(,ms;
    $csv -> print ( \*STDOUT, $row );
}

给出源数据:

field1, field1, field3, field4, field5, field6  
111, John, Doctor, 1A-jrd, ,"Tuft St  
Peoria, IL 54345  
(12.11111, 43.5555)"
121, Bob, Teacher, 2A-abcd,345,"Moore Ave  
Boston, MA 23123  
(67.11111,- 49.5567)"
131, Kyle, Engineer, 3A-bhbh, ,"Barnes St  
San Francisco, CA 34654  
(65.11111, 55.985432)"

输出:

field1," field1"," field3"," field4"," field5"," field6  "
111," John"," Doctor"," 1A-jrd"," ","(12.11111, 43.5555)"
121," Bob"," Teacher"," 2A-abcd",345,"(67.11111,- 49.5567)"
131," Kyle"," Engineer"," 3A-bhbh"," ","(65.11111, 55.985432)"

希望您能清楚知道如何进一步修改&#39; field6&#39;准确地满足您的规范。