Question

I have lots of data dumps in a pretty huge amount of data structured as follow

Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet

Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all

Which I would like to transform to something like:

Key1,Key2,Key3,Key5
Value,Other value,Maybe another value yet,
Different value,,Invaluable,Has no value at all

I mean:

Generate a collection of all the keys
Generate a header line with all the Keys
Map all the values to their correct "columns" (notice that in this example I have no "Key4", and Key3/Key5 interchanged)
Possibly in Perl, since it would be easier to use in various environments.

But I am not sure if this format is unusual, or if there is a tool that already does this.

Answer 1

使用哈希和Text::CSV_XS模块非常容易：

use strict;
use warnings;

use Text::CSV_XS;

my @rows;
my %headers;

{
    local $/ = "";

    while (<DATA>) {
        chomp;
        my %record;

        for my $line (split(/\n/)) {
            next unless $line =~ /^([^:]+):\.+\s(.+)/;
            $record{$1} = $2;
            $headers{$1} = $1;
        }

        push(@rows, \%record);
    }
}

unshift(@rows, \%headers);

my $csv = Text::CSV_XS->new({binary => 1, auto_diag => 1, eol => $/});
$csv->column_names(sort(keys(%headers)));

for my $row_ref (@rows) {
    $csv->print_hr(*STDOUT, $row_ref);
}

__DATA__
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet

Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all

输出：

Key1,Key2,Key3,Key5
Value,"Other value","Maybe another value yet",
"Different value",,Invaluable,"Has no value at all"

Answer 2

如果您的CSV格式“复杂” - 例如它包含逗号等 - 然后使用其中一个Text::CSV模块。但如果不是 - 通常就是这种情况 - 我倾向于使用split和join。

在您的方案中有用的是，您可以使用正则表达式轻松地映射记录中的键值。然后使用哈希切片输出：

#!/usr/bin/env perl

use strict;
use warnings;

#set paragraph mode - records are blank line separated. 
local $/ = "";

my @rows;
my %seen_header;

#read STDIN or files on command line, just like sed/grep 
while ( <> ) {
   #multi - line pattern, that matches all the key-value pairs,
   #and then inserts them into a hash. 
   my %this_row = m/^(\w+):\.+ (.*)$/gm;
   push ( @rows, \%this_row ); 

   #add the keys we've seen to a hash, so we 'know' what we've seen. 
   $seen_header{$_}++ for keys %this_row; 
}

#extract the keys, make them unique and ordered. 
#could set this by hand if you prefer.    
my @header = sort keys %seen_header;

#print the header row
print join ",", @header, "\n";

#iterate the rows
foreach my $row ( @rows ) {
    #use a hash slice to select the values matching @header.
    #the map is so any undefined values (missing keys) don't report errors, they
    #just return blank fields. 
    print join ",", map { $_ // '' } @{$row}{@header},"\n";
}

这是您的示例输入，产生：

Key1,Key2,Key3,Key5,
Value,Other value,Maybe another value yet,,
Different value,,Invaluable,Has no value at all,

如果你想变得非常聪明，那么循环的大部分初始构建都可以通过以下方式完成：

my @rows = map { { m/^(\w+):\.+ (.*)$/gm } } <>;

问题是 - 你还需要建立'header'数组，这意味着更复杂一点：

$seen_header{$_}++ for map { keys %$_ } @rows;

它有效，但我认为发生的事情并不清楚。

然而，您的问题的核心可能是文件大小 - 这是您遇到问题的地方，因为您需要两次读取文件 - 第一次找出整个文件中存在哪些标题，然后第二次迭代和打印：

#!/usr/bin/env perl

use strict;
use warnings;

open ( my $input, '<', 'your_file.txt') or die $!;
local $/ = "";

my %seen_header;
while ( <$input> ) { 
    $seen_header{$_}++ for m/^(\w+):/gm; 
}  

my @header = sort keys %seen_header; 

#return to the start of file:
seek ( $input, 0, 0 ); 

while ( <$input> )  {
   my %this_row = m/^(\w+):\.+ (.*)$/gm;
   print join ",", map { $_ // '' } @{$this_row}{@header},"\n";
}

这会稍慢，因为它必须两次读取文件。但它不会使用几乎相同的内存占用，因为它不会将整个文件保存在内存中。

除非您事先知道所有密钥，并且可以定义它们，否则您必须阅读该文件两次。

Answer 3

这似乎适用于您提供的数据

use strict;
use warnings 'all';

my %data;

while ( <> ) {

    next unless /^(\w+):\W*(.*\S)/;

    push @{ $data{$1} }, $2;
}

use Data::Dump;
dd \%data;

输出

{
  Key1 => ["Value", "Different value"],
  Key2 => ["Other value"],
  Key3 => ["Maybe another value yet", "Invaluable"],
  Key5 => ["Has no value at all"],
}

Parse report in blocks to CSV

3 个答案:

输出