I have lots of data dumps in a pretty huge amount of data structured as follow
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
Which I would like to transform to something like:
Key1,Key2,Key3,Key5
Value,Other value,Maybe another value yet,
Different value,,Invaluable,Has no value at all
I mean:
But I am not sure if this format is unusual, or if there is a tool that already does this.
答案 0 :(得分:2)
使用哈希和Text::CSV_XS模块非常容易:
use strict;
use warnings;
use Text::CSV_XS;
my @rows;
my %headers;
{
local $/ = "";
while (<DATA>) {
chomp;
my %record;
for my $line (split(/\n/)) {
next unless $line =~ /^([^:]+):\.+\s(.+)/;
$record{$1} = $2;
$headers{$1} = $1;
}
push(@rows, \%record);
}
}
unshift(@rows, \%headers);
my $csv = Text::CSV_XS->new({binary => 1, auto_diag => 1, eol => $/});
$csv->column_names(sort(keys(%headers)));
for my $row_ref (@rows) {
$csv->print_hr(*STDOUT, $row_ref);
}
__DATA__
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
输出:
Key1,Key2,Key3,Key5
Value,"Other value","Maybe another value yet",
"Different value",,Invaluable,"Has no value at all"
答案 1 :(得分:0)
如果您的CSV格式“复杂” - 例如它包含逗号等 - 然后使用其中一个Text::CSV
模块。但如果不是 - 通常就是这种情况 - 我倾向于使用split
和join
。
在您的方案中有用的是,您可以使用正则表达式轻松地映射记录中的键值。然后使用哈希切片输出:
#!/usr/bin/env perl
use strict;
use warnings;
#set paragraph mode - records are blank line separated.
local $/ = "";
my @rows;
my %seen_header;
#read STDIN or files on command line, just like sed/grep
while ( <> ) {
#multi - line pattern, that matches all the key-value pairs,
#and then inserts them into a hash.
my %this_row = m/^(\w+):\.+ (.*)$/gm;
push ( @rows, \%this_row );
#add the keys we've seen to a hash, so we 'know' what we've seen.
$seen_header{$_}++ for keys %this_row;
}
#extract the keys, make them unique and ordered.
#could set this by hand if you prefer.
my @header = sort keys %seen_header;
#print the header row
print join ",", @header, "\n";
#iterate the rows
foreach my $row ( @rows ) {
#use a hash slice to select the values matching @header.
#the map is so any undefined values (missing keys) don't report errors, they
#just return blank fields.
print join ",", map { $_ // '' } @{$row}{@header},"\n";
}
这是您的示例输入,产生:
Key1,Key2,Key3,Key5,
Value,Other value,Maybe another value yet,,
Different value,,Invaluable,Has no value at all,
如果你想变得非常聪明,那么循环的大部分初始构建都可以通过以下方式完成:
my @rows = map { { m/^(\w+):\.+ (.*)$/gm } } <>;
问题是 - 你还需要建立'header'数组,这意味着更复杂一点:
$seen_header{$_}++ for map { keys %$_ } @rows;
它有效,但我认为发生的事情并不清楚。
然而,您的问题的核心可能是文件大小 - 这是您遇到问题的地方,因为您需要两次读取文件 - 第一次找出整个文件中存在哪些标题,然后第二次迭代和打印:
#!/usr/bin/env perl
use strict;
use warnings;
open ( my $input, '<', 'your_file.txt') or die $!;
local $/ = "";
my %seen_header;
while ( <$input> ) {
$seen_header{$_}++ for m/^(\w+):/gm;
}
my @header = sort keys %seen_header;
#return to the start of file:
seek ( $input, 0, 0 );
while ( <$input> ) {
my %this_row = m/^(\w+):\.+ (.*)$/gm;
print join ",", map { $_ // '' } @{$this_row}{@header},"\n";
}
这会稍慢,因为它必须两次读取文件。但它不会使用几乎相同的内存占用,因为它不会将整个文件保存在内存中。
除非您事先知道所有密钥,并且可以定义它们,否则您必须阅读该文件两次。
答案 2 :(得分:-1)
这似乎适用于您提供的数据
use strict;
use warnings 'all';
my %data;
while ( <> ) {
next unless /^(\w+):\W*(.*\S)/;
push @{ $data{$1} }, $2;
}
use Data::Dump;
dd \%data;
{
Key1 => ["Value", "Different value"],
Key2 => ["Other value"],
Key3 => ["Maybe another value yet", "Invaluable"],
Key5 => ["Has no value at all"],
}