需要在perl中的^ A之前和之后从文件中提取单词

时间:2013-05-10 10:46:54

标签: perl extract

我有很多日志文件,例如cancel_log1cancel_log2 ...

所有文件都包含这样的日志

2013/05/08 17:09:18 -0700 766 | 1368058158 | 22991 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Atoo_expensive^AMoney is tight. I would to keep the service but I don't have the money at this time. Maybe I can come back in the future.^Asecuresanctuary.org^AWeb Hosting^Asecuresanctuary^A05/09/2009^A05/08/2013    
2013/05/07 17:45:35 -0700 219 | 1367973935 | 23388 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Aother^AYahoo China service close^Alifesig.com^AWeb Hosting^Akennethli2005^A05/10/2008^A05/07/2013    
2013/05/08 17:30:57 -0700 115 | 1368059457 | 22982 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Atoo_expensive^A^Asecuresanctuary.org^AWeb Hosting^Asecuresanctuary^A05/09/2009^A05/08/2013    
2013/05/07 17:59:38 -0700 694 | 1367974778 | 23381 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Amissing_feature^A^Achuanqisf244baidu.com^ADomains^Achuanqisf244baidu^A05/07/2013^A05/07/2013    
2013/05/08 17:33:03 -0700 815 | 1368059583 | 23000 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Aretired^A^Asisterthrifty.com^ADomains^Atrinaboice^A08/09/2005^A05/08/2013    
2013/05/07 17:59:40 -0700 231 | 1367974780 | 23389 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Amissing_feature^A^Achuanqisf239baidu.com^ADomains^Achuanqisf239baidu^A05/07/2013^A05/07/2013    

我想提取由^A分隔的单词并写入CSV文件。

例如我的输出文件会是这样的:

missing_feature chuanqisf239baidu.com Domains chuanqisf239baidu

感谢任何帮助。

3 个答案:

答案 0 :(得分:2)

您可以轻松拆分^A字段,然后过滤掉数据。我只是采用了您表示感兴趣的列范围,并在使用逗号加入之前添加了一些引用逻辑。

while ( <> ) {
    say join( ',', map { index( $_, ',' ) > -1 ? qq/"$_"/ : $_ } @{[ split /\^A/ ]}[1..5] );
}

要将其分解为更多步骤,就像这样。

  1. 我使用“菱形运算符”,因为如果提取数据是主要问题,则不需要我为您编写文件处理代码。我将它用于通用输入循环。

  2. 所以我们split这样的行:split /\^A/,它给了我们一个列表。

  3. 然后,我们通过在切片表达式中执行操作来获取该列表的切片。如果你有一个数组@a@a[2..4]是一种只提取你感兴趣的元素的方法。所以@{[ split /\^A/ ]}是一个“数组表达式”,而@{[ split /\^A/ ]}[1..5]是该阵列的切片

  4. 但它是一个像其他任何一个列表,所以把它放在一个map表达式中,我们检查它是否在字段中有逗号,如果是,我们用双引号将其包装({ {1}})如果不是,我们只是将其作为自己返回。

  5. 然后我们只需使用qq/"$_"/在每个字段之间插入逗号,我们join生成的字符串。

  6. 然而,上面的方法是做CSV的不好方法,它只是中途做。在真正的CSV输出中,如果引用字段,则必须处理任何可能的嵌入式引号。

    所以使用Text::CSV,即:

    say

答案 1 :(得分:1)

这个简单的程序似乎可以满足您的需求。它期望输入文件的名称作为命令行上的参数。

use strict;
use warnings;

my $date = qr|^[0-9]{2}/[0-9]{2}/[0-9]{4}\s*$|;

while ( <DATA> ) {
  my @fields = split /\^A/;
  shift @fields;
  pop @fields while $fields[-1] =~ $date;
  print join(',', @fields), "\n";
}

如果您的字段中包含逗号,则需要引用它们,您应该将print行替换为此

print join(',', map { /,/ ? '"'.s/"/\\"/gr . '"' : $_ } @fields), "\n";

引用包含逗号的行并转义这些字段可能包含的任何引号。

<强>输出

too_expensive,Money is tight. I would to keep the service but I don&#39;t have the money at this time. Maybe I can come back in the future.,securesanctuary.org,Web Hosting,securesanctuary
too_expensive,Money is tight. I would to keep the service but I don&#39;t have the money at this time. Maybe I can come back in the future.,securesanctuary.org,Web Hosting,securesanctuary
other,Yahoo China service close,lifesig.com,Web Hosting,kennethli2005
other,Yahoo China service close,lifesig.com,Web Hosting,kennethli2005
too_expensive,,securesanctuary.org,Web Hosting,securesanctuary
too_expensive,,securesanctuary.org,Web Hosting,securesanctuary
missing_feature,,chuanqisf244baidu.com,Domains,chuanqisf244baidu
missing_feature,,chuanqisf244baidu.com,Domains,chuanqisf244baidu
retired,,sisterthrifty.com,Domains,trinaboice
retired,,sisterthrifty.com,Domains,trinaboice
missing_feature,,chuanqisf239baidu.com,Domains,chuanqisf239baidu
missing_feature,,chuanqisf239baidu.com,Domains,chuanqisf239baidu

答案 2 :(得分:0)

这是另一种选择:

use strict;
use warnings;

my @words;
while (<>) {
    @words = /\^A(.+?)\^A/g and print +( join ',', @words ) . "\n";
}

用法:perl script.pl inFile [>outFile]

最后一个可选参数将输出定向到文件。

数据集输出:

too_expensive,securesanctuary.org,securesanctuary
other,lifesig.com,kennethli2005
too_expensive,securesanctuary.org,securesanctuary
missing_feature,chuanqisf244baidu.com,chuanqisf244baidu
retired,sisterthrifty.com,trinaboice
missing_feature,chuanqisf239baidu.com,chuanqisf239baidu

该脚本使用正则表达式全局捕获每行^A之间的文本,然后在join结果之前使用“,”print捕获这些文本。< / p>

and用作short circuit,因此只有在捕获了单词(无空行)时才会出现print

希望这有帮助!