假设我们有一个格式如下的数据文件:
$ cat csv.txt
//a,b,c,d,e,f,g
23,25,76,45,0,0,0
21,24,25,0,0,0,0
20,0,0,52,75,75,52
...
(many lines)
...
将此数据转换为csv格式的最快方法是什么,假设此文件太大而无法读入内存?
输出应包含一个标题,其中包含所有可能的"键"在文件中;如果特定行上缺少特定键,则该键的值应该等于零。例如:
# transform_test.pl
# build set of all used keys.
my %usedKey;
open FILE, "data.txt";
while(<FILE>) {
chomp $_;
my @fields = split;
foreach my $field (@fields) {
my ($key,$value) = split(":",$field);
$usedKey{$key} = 1;
}
}
close FILE;
# build array of all used keys, but sorted.
my @sorted_keys = sort keys %usedKey;
# print header
my $header = "//";
foreach my $key (@sorted_keys) { $header .= "$key,"; }
chop $header;
print "$header\n";
# read through file again to transform the data;
open FILE, "data.txt";
while(<FILE>) {
chomp $_;
# build current line hash
my @fields = split;
my %currentData;
foreach my $field (@fields) {
my ($key,$value) = split(":",$field);
$currentData{$key} = $value;
}
# build string by looping over all sorted keys.
my $toPrint = "";
foreach my $key (@sorted_keys) {
$toPrint .= defined $currentData{$key} ? "$currentData{$key}," : "0,";
}
chop $toPrint;
print "$toPrint\n";
}
这是我尝试过的。它有效,但我觉得所有的循环都在减慢我的速度。有没有更快,更优化的方法来做到这一点?我使用过Perl,但我当然愿意切换到Python或其他东西。
{{1}}
答案 0 :(得分:3)
嗯,根据您的规范,这似乎可以解决问题:
#!/usr/bin/env perl
use strict;
use warnings 'all';
my @header = qw ( a b c d e f g h i j );
print join ",", @header,"\n";
while ( <DATA> ) {
my %row = map { /(\w+):(\d+)/ } split;
print join ",", map { $_ // 0 } @row{@header},"\n";
}
__DATA__
a:23 b:25 c:76 d:45
a:21 b:24 c:25
a:20 d:52 e:75 f:75 g:52
输出:
a,b,c,d,e,f,g,h,i,j,
23,25,76,45,0,0,0,0,0,0,
21,24,25,0,0,0,0,0,0,0,
20,0,0,52,75,75,52,0,0,0,
虽然依赖硬编码密钥。如果你需要动态键控,那么...它取决于你的文件有多大,因为你需要处理它两次。
#!/usr/bin/env perl
use strict;
use warnings 'all';
use Data::Dumper;
my %usedKeys;
my @rows;
while (<DATA>) {
my %row = map {/(\w+):(\d+)/} split;
push @rows, \%row;
$usedKeys{$_}++ for keys %row;
}
my @header = sort keys %usedKeys;
print join ",", @header, "\n";
foreach my $row (@rows) {
print join ",", map { $_ // 0 } @{$row}{@header}, "\n";
}
__DATA__
a:23 b:25 c:76 d:45
a:21 b:24 c:25
a:20 d:52 e:75 f:75 g:52
这会将其淹没在记忆中。但你可以先两次通过文件。 (正如你所做)建立'看见的钥匙'。这与你的内容大致相同 - 在开始第二遍之前,你只需要seek
回到文件句柄的开头。
不幸的是,由于您无法知道您将看到哪些密钥,因此没有比两遍扫描文件更有效的选项,并且依赖于内核缓存。
E.g:
while (<DATA>) {
my %row = map {/(\w):(\d+)/} split;
push @rows, \%row;
$usedKeys{$_}++ for keys %row;
}
答案 1 :(得分:1)
如果文件太大而无法容纳到内存中,则需要两次传递:第一次构建所有列名称的列表,第二次将每行转换为相应的CSV记录。我会这样写的
该程序期望输入文件的路径作为命令行上的参数,并将输出写入STDOUT,可以在命令行上重定向
use strict;
use warnings 'all';
use Fcntl ':seek';
my ($file) = @ARGV;
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
my @heads;
{
my %heads;
while ( <$fh> ) {
for my $val ( /([^\s:]+):/g ) {
push @heads, $val unless $heads{$val}++;
}
}
}
print join(',', @heads), "\n";
seek $fh, 0, SEEK_SET;
while ( <$fh> ) {
my %values = /[^\s:]+/g;
print join(',', map { $_ // 0 } @values{@heads}), "\n";
}
a,b,c,d,e,f,g
23,25,76,45,0,0,0
21,24,25,0,0,0,0
20,0,0,52,75,75,52