解析HTML日志文件并获取特定格式的文本文件

时间:2017-05-09 09:54:39

标签: perl

我想使用Perl解析文本文件。此文本文件包含一些HTML文件的日志,如下所示:

class A {
  class ReadyHandler { // fires off the callback when needed
    let callback;
    init(callback: ()->Void) {
      self.callback = callback
    }
  }
  let readyHandler: ReadyHandler
  let ready = false
  init() {
    readyHandler = ReadyHandler(callback: {self.ready = true})
  }
}

每行包含一个错误号及其描述。

解析后,EXPECTED OUTPUT如下:

Details from /projects/git/Changelog.html file:
NEW_FEATURES: <a href="http://jira.xyz.com/browse/JIRA-4208">JIRA-4208</a><span style='mso-spacerun:yes'>   </span>Add New Config C support in code
BUG_FIX: <a href="http://jira.xyz.com/browse/BUGJIRA-31">BUGJIRA-31</a><span style='mso-spacerun:yes'>   </span>Bugfix of some old bug
NEW_FEATURES: <a href="http://jira.xyz.com/browse/ZEERA-273">ZEERA-273</a><span style='mso-spacerun:yes'>   </span>Add support for some other feature.

Details from /projects/git/Changelog2.html file:
BUG_FIX: <a href="http://jira.xyz.com/browse/BUGJIRA-33">BUGJIRA-33</a><span style='mso-spacerun:yes'>   </span>Bugfix of an issue
NEW_FEATURES: <a href="http://jira.xyz.com/browse/JIRA-4209">JIRA-4209</a><span style='mso-spacerun:yes'>   </span>Add New Config D support in code

即。所有错误编号后跟其描述。

如果可能,我想将输出写在另一个文件JIRA-4208, BUGJIRA-31, ZEERA-273, BUGJIRA-33, JIRA-4209 : Add New Config C support in code, Bugfix of some old bug, Add support for some other feature, Bugfix of an issue, Add New Config D support in code

EDIT-1:

我的代码如下:

output.txt

输出是:

#!/usr/bin/perl
open (FILE, 'input_file1.txt') or die "Could not read from file, exit...";
while(<FILE>)
{
  chomp;
  ($junk0,$junk1,$junk2,$junk3,$junk4,$BUG_NUMBR) = split /[:<="">]+/,$_;
  print "$BUG_NUMBR \n";
}
close FILE;
exit;

这与上面显示的预期输出完全不同。我无法为预期输出的第二部分定义适当的正则表达式,这是对错误的简短描述。

2 个答案:

答案 0 :(得分:0)

您不需要正则表达式。您的split模式很有趣,但它可以完成工作。

也可以采取其余的结果。我用数组替换了你的$junk变量。 Perl允许您使用索引-1从右侧获取最后一个元素,因此将文本输出是微不足道的,因为它是在最后一个>之后。

use strict;
use warnings;

my ( @numbers, @text );
while (my $line = <DATA>) {
    chomp $line;
    my @stuff = split /[:<="">]+/, $line;
    push @numbers, $stuff[5];
    push @text, $stuff[-1];
}

print join ', ', @numbers;
print ' : ';
print join ', ', @text;

__DATA__
NEW_FEATURES: <a href="http://jira.xyz.com/browse/JIRA-4208">JIRA-4208</a><span style='mso-spacerun:yes'>   </span>Add New Config C support in code
BUG_FIX: <a href="http://jira.xyz.com/browse/BUGJIRA-31">BUGJIRA-31</a><span style='mso-spacerun:yes'>   </span>Bugfix of some old bug
NEW_FEATURES: <a href="http://jira.xyz.com/browse/ZEERA-273">ZEERA-273</a><span style='mso-spacerun:yes'>   </span>Add support for some other feature.
BUG_FIX: <a href="http://jira.xyz.com/browse/BUGJIRA-33">BUGJIRA-33</a><span style='mso-spacerun:yes'>   </span>Bugfix of an issue
NEW_FEATURES: <a href="http://jira.xyz.com/browse/JIRA-4209">JIRA-4209</a><span style='mso-spacerun:yes'>   </span>Add New Config D support in code

我还添加了严格和警告,并使你的变量有词汇。

另请注意,如果文字包含文字><或引号或其他内容,您的代码就会中断。这是一种奇怪的格式,而HTML解析器并不能真正帮助你。

答案 1 :(得分:0)

上面提到的问题陈述的代码如下:

#!/usr/bin/perl

use strict;
use warnings;

open (FILE, 'perl_input_file1.txt') or die $!;
my ( @numbers, @text );
while (my $line = <FILE>) {
    chomp $line;
    $line =~ /^Details/ and next;
    my @stuff = split /[:<="">]+/, $line;
    push @numbers, $stuff[5];
    push @text, $stuff[-1];
}
close FILE;
print join ', ', @numbers;
print ': ';
print join ', ', @text;
print "\n";

此代码的输出为:

JIRA-4208, BUGJIRA-31, ZEERA-273, BUGJIRA-33, JIRA-4209: Add New Config C support in code, Bugfix of some old bug, Add support for some other feature, Bugfix of an issue, Add New Config D support in code

这与问题中提到的我期望的预期输出相同。

我想再次感谢@simbabque的指导和方法。

此致