我有一个问题。我想编写一个perl脚本来将Mailgun输出解析为csv格式。我认为'split'和'join'函数适用于此过程。以下是一些示例数据:
示例数据
{
"geolocation": {
"city": "Random City",
"region": "State",
"country": "US"
},
"url": "https://www4.website.com/register/1234567",
"timestamp": "1237854980723.0239847"
}
{
"geolocation": {
"city": "Random City2",
"region": "State2",
"country": "mEXICO"
},
"url": "https://www4.website2.com/register/ABCDE567",
"timestamp": "1237854980723.0239847"
}
所需输出
“城市”, “区域”, “国家”, “URL”, “时间戳”
“随机城市”,“州”,“美国”,“https://www4.website.com/register/1234567”,“1237854980723.0239847”
“Random City_2”,“State_2”,“mEXICO”,“www4.website2.com/ABCDE567”,“1237854980723.0239847_2”
我的目标是获取我的Sample数据并以逗号分隔的CSV文件创建所需的输出。我不确定如何解决这个问题。通常我会尝试使用批处理文件中的一系列单行来破解这一点,但我更喜欢perl脚本。真实数据将包含更多信息。但是,只要弄清楚如何解析一般结构就好了。
这是我在批处理文件中的内容。
代码
perl -p -i.bak -e "s/(,$|,+ +$|^.*?{$|^.*?}.*?$|^.*?],.*?$)//gi" file.txt
rem Removes all unnecessary characters and lines with { and }. ^
perl -p -i.bak -e "s/(^ +| +$)//gi" file.txt
perl -p -i.bak -e "s/^\n$//gi" file.txt
rem Removes all blank lines in initial file. Next one-liner takes care of trailing and beginning
rem whitespace. The file is nice and clean now.
perl -p -e "s/(^\".*?\"):.*?$/$1/gi" file.txt > header.txt
rem retains only header info and puts into 'header.txt' ^
perl -p -e "s/^\".*?\": +(\".*?\"$)/$1/gi" file.txt > data.txt
rem retains only data that is associated with each field.
perl -p -i.bak -e "s/\n/,/gi" data.txt
rem replaces new line character with ',' delimiter.
perl -p -i.bak -e "s/^/\n/gi" data.txt
rem drops data down a line
perl -p -i.bak -e "s/\n/,/gi" header.txt
rem replaces new line character with ',' delimiter.
copy header.txt+data.txt report.txt
rem copies both files together. Since there is the same amount of fields as there are data
rem delimiters, the columns and headers match.
我的输出
“城市”, “区域”, “国家”, “URL”, “时间戳”
“随机城市”,“州”,“美国”,“https://www4.website.com/register/1234567”,1237854980723.0239847
这可以解决问题,但精简的脚本会更好。不同的情况会影响这个批处理脚本我需要更坚实的东西。有什么建议??
答案 0 :(得分:1)
您可以将一个Perl脚本与一个正则表达式
一起使用#!/usr/bin/env perl
use v5.10;
use Data::Dumper;
$_ = <<TXT;
{
"geolocation": {
"city": "Random City",
"region": "State",
"country": "US"
},
"url": "https://www4.website.com/register/1234567",
"timestamp": "1237854980723.0239847"
}
TXT
my @matches = /\s*\s*("[^"]+")\s*\s*:\s*("[^"]+")/gmx;
my %hash = @matches;
say join(",", keys %hash);
say join(",", values %hash);
哪个输出:
"city","country","region","timestamp","url"
"Random City","US","State","1237854980723.0239847","https://www4.website.com/register/1234567"
当然,如果您想使用STDIN,请将字符串定义替换为:
local $/ = undef;
$_ = <>;
如果您需要更强大的代码,我建议首先匹配包含在大括号中的数据块。然后你会搜索key:values。
我会写这个program.pl
文件:
#!/usr/bin/env perl
use v5.10;
use Data::Dumper;
local $/ = undef;
open FILE, $ARGV[0] or die $!;
$_ = <FILE>;
close FILE;
# Match all group { ... }
my @groups = /((?&BRACKETED))
(?(DEFINE)
(?<WORD> [^\{\}]+ )
(?<BRACKETED> \s* \{ (?&TEXT)? \s* \} )
(?<TEXT> (?: (?&WORD) | (?&BRACKETED) )+ )
)/gmx;
# Match any key:value pairs inside each group
my @results;
for(grep($_,@groups)) {
push @results, {/\s*\s*"([^"]+)"\s*\s*:\s*("[^"]+")/gmx};
}
# For each result, we print the keys we want
for(@results) {
say join ",", @$_{qw/city region country url timestamp/};
}
然后一个批处理文件来调用脚本:
rem How to call it...
@perl program.pl text.txt > report.txt
答案 1 :(得分:0)
不要嘲笑@ coin的regex-fu,但使用CPAN模块的优势包括获得一个更灵活的解决方案,您可以在此之前构建,并利用其他人已经解决的边缘案例处理。
此解决方案使用JSON模块来解析您的传入数据(我假设它看起来像JSON),并使用CSV模块生成高质量的CSV,它可以处理内置嵌入式引号和逗号等内容。数据。
use warnings;
use strict;
use JSON qw/decode_json/;
use Text::CSV_XS;
my $json_data_as_string = <<EOL;
{
"geolocation": {
"city": "Random City",
"region": "State",
"country": "US"
},
"url": "https://www4.website.com/register/1234567",
"timestamp": "1237854980723.0239847"
}
EOL
my $s = decode_json($json_data_as_string);
my $csv = Text::CSV_XS->new({ binary => 1 });
$csv->combine(
$s->{geolocation}{city},
$s->{geolocation}{region},
$s->{geolocation}{country},
$s->{url},
$s->{timestamp},
) || die $csv->error_diag;;
print $csv->string, "\n";
要将文件中的数据读入$ json_data_as_string,您可以使用@ coin解决方案中的代码。