我当前的csv文件如下所示:
field1, field1, field3, field4, field5, field6
111, John, Doctor, 1A-jrd, ,"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"
121, Bob, Teacher, 2A-abcd, 345, "Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"
131, Kyle, Engineer, 3A-bhbh, , "Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"
在某些情况下,field5没有值。此外,field6在引号内并具有换行符。例如:第一行数据的field6实际上是
"Tuft St
Peoria, IL 54345
(12.11111, 43.5575)"
我需要编写一个脚本来解析这个文件并返回12.111,43.557来代替field6的当前值,这样最终的csv文件看起来就像
field1, field1, field3, field4, field5, field6
111, John, Doctor, 1A-jrd, , "12.111, 43.555"
121, Bob, Teacher, 2A-abcd, 345, "67.111,- 49.556"
131, Kyle, Engineer, 3A-bhbh, , "65.111, 55.985"
我看过cvsparser,但我的理解是,只有当整个数据行在一行上没有任何换行符时它才有效。此外,我不能简单地使用逗号分割行,因为有些地址中有多个逗号。关于如何解析这个csv文件的任何建议?
答案 0 :(得分:1)
对于这种“非结构化csv”格式,您可以使用Marpa::R2,一个Perl接口Marpa,一般的BNF解析器。
数据可以在BNF中描述为this ::= that
(~
运算符定义词法规则)。 ::=
规则中的Parens,例如(header [\n])
表示“不包含在解析结果中。”
解析器返回一个数据结构([ id, child1, child2 ... ]
格式的数组数组),从中可以提取数据。
您还可以在同一个或单独的包中将semantic actions定义为Perl sub
来处理数据。
示例脚本及其输出(基于您的数据)如下所示。
脚本:
use 5.010;
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Indent = 1;
$Data::Dumper::Terse = 1;
$Data::Dumper::Deepcopy = 1;
use Marpa::R2;
my $g = Marpa::R2::Scanless::G->new( { source => \(<<'END_OF_SOURCE'),
:default ::= action => [ name, value]
lexeme default = action => [ name, value] latm => 1
csv ::= (header [\n]) lines
header ::= column+ separator => column_sep
column_sep ~ ', '
column ~ 'field' [1-6]
lines ::= line+ separator => [\n]
line ::= fields1_5 (',') field6
field_sep ~ ','
fields1_5 ::= field1_5+ separator => field_sep
field1_5 ~ num | word | code
field6 ~ address
num ~ [\d]+
word ~ [A-Za-z]+
code ~ num word '-' word
address ~ '"' address_chars '"'
address_chars ~ [^\"]+ #"
:discard ~ space
space ~ ' '
END_OF_SOURCE
} );
my $input = <<EOI;
field1, field1, field3, field4, field5, field6
111, John, Doctor, 1A-jrd, ,"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"
121, Bob, Teacher, 2A-abcd, 345, "Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"
131, Kyle, Engineer, 3A-bhbh, , "Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"
EOI
say Dumper ${ $g->parse( \$input, { trace_terminals => 0 } ) };
输出:
[
'csv',
[
'lines',
[
'line',
[
'fields1_5',
[
'field1_5',
'111'
],
[
'field1_5',
'John'
],
[
'field1_5',
'Doctor'
],
[
'field1_5',
'1A-jrd'
]
],
[
'field6',
'"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"'
]
],
[
'line',
[
'fields1_5',
[
'field1_5',
'121'
],
[
'field1_5',
'Bob'
],
[
'field1_5',
'Teacher'
],
[
'field1_5',
'2A-abcd'
],
[
'field1_5',
'345'
]
],
[
'field6',
'"Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"'
]
],
[
'line',
[
'fields1_5',
[
'field1_5',
'131'
],
[
'field1_5',
'Kyle'
],
[
'field1_5',
'Engineer'
],
[
'field1_5',
'3A-bhbh'
]
],
[
'field6',
'"Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"'
]
]
]
]
答案 1 :(得分:0)
你做不到。在字段6中允许使用逗号时,这是一个有效的文件
A,B,C,d ,,È A,B,C,d,E,F
你无法确定这个文件是否包含一个或两个条目,因为第一个数据集的字段6可以是'E'或'E \ na,b,c,d,e,f'
答案 2 :(得分:0)
您可以将csv
库用于此
import csv
with open('myfile.csv') as myfile:
csv_file = csv.reader(myfile, delimiter = ',')
现在你有了行,做任何你想做的事。
答案 3 :(得分:0)
您需要针对该类数据的CSV解析器。我建议使用perl和Text :: CSV:
这样的事情:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV -> new( { 'binary' => 1, eol => "\n" } );
open ( my $input_fh, '<', "sample.csv" ) or die $!;
my $header = $csv -> getline ( $input_fh );
$csv -> print ( \*STDOUT, $header );
while ( my $row = $csv -> getline ( $input_fh ) ) {
$row -> [5] =~ s,.*\(,\(,ms;
$csv -> print ( \*STDOUT, $row );
}
给出源数据:
field1, field1, field3, field4, field5, field6
111, John, Doctor, 1A-jrd, ,"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"
121, Bob, Teacher, 2A-abcd,345,"Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"
131, Kyle, Engineer, 3A-bhbh, ,"Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"
输出:
field1," field1"," field3"," field4"," field5"," field6 "
111," John"," Doctor"," 1A-jrd"," ","(12.11111, 43.5555)"
121," Bob"," Teacher"," 2A-abcd",345,"(67.11111,- 49.5567)"
131," Kyle"," Engineer"," 3A-bhbh"," ","(65.11111, 55.985432)"
希望您能清楚知道如何进一步修改&#39; field6&#39;准确地满足您的规范。