Question

我有很多日志文件，例如cancel_log1，cancel_log2 ...

所有文件都包含这样的日志

2013/05/08 17:09:18 -0700 766 | 1368058158 | 22991 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Atoo_expensive^AMoney is tight. I would to keep the service but I don&#39;t have the money at this time. Maybe I can come back in the future.^Asecuresanctuary.org^AWeb Hosting^Asecuresanctuary^A05/09/2009^A05/08/2013    
2013/05/07 17:45:35 -0700 219 | 1367973935 | 23388 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Aother^AYahoo China service close^Alifesig.com^AWeb Hosting^Akennethli2005^A05/10/2008^A05/07/2013    
2013/05/08 17:30:57 -0700 115 | 1368059457 | 22982 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Atoo_expensive^A^Asecuresanctuary.org^AWeb Hosting^Asecuresanctuary^A05/09/2009^A05/08/2013    
2013/05/07 17:59:38 -0700 694 | 1367974778 | 23381 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Amissing_feature^A^Achuanqisf244baidu.com^ADomains^Achuanqisf244baidu^A05/07/2013^A05/07/2013    
2013/05/08 17:33:03 -0700 815 | 1368059583 | 23000 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Aretired^A^Asisterthrifty.com^ADomains^Atrinaboice^A08/09/2005^A05/08/2013    
2013/05/07 17:59:40 -0700 231 | 1367974780 | 23389 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Amissing_feature^A^Achuanqisf239baidu.com^ADomains^Achuanqisf239baidu^A05/07/2013^A05/07/2013

我想提取由^A分隔的单词并写入CSV文件。

例如我的输出文件会是这样的：

missing_feature chuanqisf239baidu.com Domains chuanqisf239baidu

感谢任何帮助。

Answer 1

您可以轻松拆分^A字段，然后过滤掉数据。我只是采用了您表示感兴趣的列范围，并在使用逗号加入之前添加了一些引用逻辑。

while ( <> ) {
    say join( ',', map { index( $_, ',' ) > -1 ? qq/"$_"/ : $_ } @{[ split /\^A/ ]}[1..5] );
}

要将其分解为更多步骤，就像这样。

我使用“菱形运算符”，因为如果提取数据是主要问题，则不需要我为您编写文件处理代码。我将它用于通用输入循环。
所以我们split这样的行：split /\^A/，它给了我们一个列表。
然后，我们通过在切片表达式中执行操作来获取该列表的切片。如果你有一个数组@a，@a[2..4]是一种只提取你感兴趣的元素的方法。所以@{[ split /\^A/ ]}是一个“数组表达式”，而@{[ split /\^A/ ]}[1..5]是该阵列的切片。
但它是一个像其他任何一个列表，所以把它放在一个map表达式中，我们检查它是否在字段中有逗号，如果是，我们用双引号将其包装（{ {1}}）如果不是，我们只是将其作为自己返回。
然后我们只需使用qq/"$_"/在每个字段之间插入逗号，我们join生成的字符串。

然而，上面的方法是做CSV的不好方法，它只是中途做。在真正的CSV输出中，如果引用字段，则必须处理任何可能的嵌入式引号。

所以使用Text::CSV，即：

say

Answer 2

这个简单的程序似乎可以满足您的需求。它期望输入文件的名称作为命令行上的参数。

use strict;
use warnings;

my $date = qr|^[0-9]{2}/[0-9]{2}/[0-9]{4}\s*$|;

while ( <DATA> ) {
  my @fields = split /\^A/;
  shift @fields;
  pop @fields while $fields[-1] =~ $date;
  print join(',', @fields), "\n";
}

如果您的字段中包含逗号，则需要引用它们，您应该将print行替换为此

print join(',', map { /,/ ? '"'.s/"/\\"/gr . '"' : $_ } @fields), "\n";

引用包含逗号的行并转义这些字段可能包含的任何引号。

<强>输出

too_expensive,Money is tight. I would to keep the service but I don&#39;t have the money at this time. Maybe I can come back in the future.,securesanctuary.org,Web Hosting,securesanctuary
too_expensive,Money is tight. I would to keep the service but I don&#39;t have the money at this time. Maybe I can come back in the future.,securesanctuary.org,Web Hosting,securesanctuary
other,Yahoo China service close,lifesig.com,Web Hosting,kennethli2005
other,Yahoo China service close,lifesig.com,Web Hosting,kennethli2005
too_expensive,,securesanctuary.org,Web Hosting,securesanctuary
too_expensive,,securesanctuary.org,Web Hosting,securesanctuary
missing_feature,,chuanqisf244baidu.com,Domains,chuanqisf244baidu
missing_feature,,chuanqisf244baidu.com,Domains,chuanqisf244baidu
retired,,sisterthrifty.com,Domains,trinaboice
retired,,sisterthrifty.com,Domains,trinaboice
missing_feature,,chuanqisf239baidu.com,Domains,chuanqisf239baidu
missing_feature,,chuanqisf239baidu.com,Domains,chuanqisf239baidu

Answer 3

这是另一种选择：

use strict;
use warnings;

my @words;
while (<>) {
    @words = /\^A(.+?)\^A/g and print +( join ',', @words ) . "\n";
}

用法：perl script.pl inFile [>outFile]

最后一个可选参数将输出定向到文件。

数据集输出：

too_expensive,securesanctuary.org,securesanctuary
other,lifesig.com,kennethli2005
too_expensive,securesanctuary.org,securesanctuary
missing_feature,chuanqisf244baidu.com,chuanqisf244baidu
retired,sisterthrifty.com,trinaboice
missing_feature,chuanqisf239baidu.com,chuanqisf239baidu

该脚本使用正则表达式全局捕获每行^A之间的文本，然后在join结果之前使用“，”print捕获这些文本。< / p>

and用作short circuit，因此只有在捕获了单词（无空行）时才会出现print。

希望这有帮助！

需要在perl中的^ A之前和之后从文件中提取单词

3 个答案: