我有很多日志文件,例如cancel_log1
,cancel_log2
...
所有文件都包含这样的日志
2013/05/08 17:09:18 -0700 766 | 1368058158 | 22991 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Atoo_expensive^AMoney is tight. I would to keep the service but I don't have the money at this time. Maybe I can come back in the future.^Asecuresanctuary.org^AWeb Hosting^Asecuresanctuary^A05/09/2009^A05/08/2013
2013/05/07 17:45:35 -0700 219 | 1367973935 | 23388 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Aother^AYahoo China service close^Alifesig.com^AWeb Hosting^Akennethli2005^A05/10/2008^A05/07/2013
2013/05/08 17:30:57 -0700 115 | 1368059457 | 22982 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Atoo_expensive^A^Asecuresanctuary.org^AWeb Hosting^Asecuresanctuary^A05/09/2009^A05/08/2013
2013/05/07 17:59:38 -0700 694 | 1367974778 | 23381 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Amissing_feature^A^Achuanqisf244baidu.com^ADomains^Achuanqisf244baidu^A05/07/2013^A05/07/2013
2013/05/08 17:33:03 -0700 815 | 1368059583 | 23000 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Aretired^A^Asisterthrifty.com^ADomains^Atrinaboice^A08/09/2005^A05/08/2013
2013/05/07 17:59:40 -0700 231 | 1367974780 | 23389 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Amissing_feature^A^Achuanqisf239baidu.com^ADomains^Achuanqisf239baidu^A05/07/2013^A05/07/2013
我想提取由^A
分隔的单词并写入CSV文件。
例如我的输出文件会是这样的:
missing_feature chuanqisf239baidu.com Domains chuanqisf239baidu
感谢任何帮助。
答案 0 :(得分:2)
您可以轻松拆分^A
字段,然后过滤掉数据。我只是采用了您表示感兴趣的列范围,并在使用逗号加入之前添加了一些引用逻辑。
while ( <> ) {
say join( ',', map { index( $_, ',' ) > -1 ? qq/"$_"/ : $_ } @{[ split /\^A/ ]}[1..5] );
}
要将其分解为更多步骤,就像这样。
我使用“菱形运算符”,因为如果提取数据是主要问题,则不需要我为您编写文件处理代码。我将它用于通用输入循环。
所以我们split
这样的行:split /\^A/
,它给了我们一个列表。
然后,我们通过在切片表达式中执行操作来获取该列表的切片。如果你有一个数组@a
,@a[2..4]
是一种只提取你感兴趣的元素的方法。所以@{[ split /\^A/ ]}
是一个“数组表达式”,而@{[ split /\^A/ ]}[1..5]
是该阵列的切片。
但它是一个像其他任何一个列表,所以把它放在一个map
表达式中,我们检查它是否在字段中有逗号,如果是,我们用双引号将其包装({ {1}})如果不是,我们只是将其作为自己返回。
然后我们只需使用qq/"$_"/
在每个字段之间插入逗号,我们join
生成的字符串。
然而,上面的方法是做CSV的不好方法,它只是中途做。在真正的CSV输出中,如果引用字段,则必须处理任何可能的嵌入式引号。
所以使用Text::CSV
,即:
say
答案 1 :(得分:1)
这个简单的程序似乎可以满足您的需求。它期望输入文件的名称作为命令行上的参数。
use strict;
use warnings;
my $date = qr|^[0-9]{2}/[0-9]{2}/[0-9]{4}\s*$|;
while ( <DATA> ) {
my @fields = split /\^A/;
shift @fields;
pop @fields while $fields[-1] =~ $date;
print join(',', @fields), "\n";
}
如果您的字段中包含逗号,则需要引用它们,您应该将print
行替换为此
print join(',', map { /,/ ? '"'.s/"/\\"/gr . '"' : $_ } @fields), "\n";
引用包含逗号的行并转义这些字段可能包含的任何引号。
<强>输出强>
too_expensive,Money is tight. I would to keep the service but I don't have the money at this time. Maybe I can come back in the future.,securesanctuary.org,Web Hosting,securesanctuary
too_expensive,Money is tight. I would to keep the service but I don't have the money at this time. Maybe I can come back in the future.,securesanctuary.org,Web Hosting,securesanctuary
other,Yahoo China service close,lifesig.com,Web Hosting,kennethli2005
other,Yahoo China service close,lifesig.com,Web Hosting,kennethli2005
too_expensive,,securesanctuary.org,Web Hosting,securesanctuary
too_expensive,,securesanctuary.org,Web Hosting,securesanctuary
missing_feature,,chuanqisf244baidu.com,Domains,chuanqisf244baidu
missing_feature,,chuanqisf244baidu.com,Domains,chuanqisf244baidu
retired,,sisterthrifty.com,Domains,trinaboice
retired,,sisterthrifty.com,Domains,trinaboice
missing_feature,,chuanqisf239baidu.com,Domains,chuanqisf239baidu
missing_feature,,chuanqisf239baidu.com,Domains,chuanqisf239baidu
答案 2 :(得分:0)
这是另一种选择:
use strict;
use warnings;
my @words;
while (<>) {
@words = /\^A(.+?)\^A/g and print +( join ',', @words ) . "\n";
}
用法:perl script.pl inFile [>outFile]
最后一个可选参数将输出定向到文件。
数据集输出:
too_expensive,securesanctuary.org,securesanctuary
other,lifesig.com,kennethli2005
too_expensive,securesanctuary.org,securesanctuary
missing_feature,chuanqisf244baidu.com,chuanqisf244baidu
retired,sisterthrifty.com,trinaboice
missing_feature,chuanqisf239baidu.com,chuanqisf239baidu
该脚本使用正则表达式全局捕获每行^A
之间的文本,然后在join
结果之前使用“,”print
捕获这些文本。< / p>
and
用作short circuit,因此只有在捕获了单词(无空行)时才会出现print
。
希望这有帮助!