Question

我想使用Perl解析文本文件。此文本文件包含一些HTML文件的日志，如下所示：

class A {
  class ReadyHandler { // fires off the callback when needed
    let callback;
    init(callback: ()->Void) {
      self.callback = callback
    }
  }
  let readyHandler: ReadyHandler
  let ready = false
  init() {
    readyHandler = ReadyHandler(callback: {self.ready = true})
  }
}

每行包含一个错误号及其描述。

解析后，EXPECTED OUTPUT如下：

Details from /projects/git/Changelog.html file:
NEW_FEATURES: <a href="http://jira.xyz.com/browse/JIRA-4208">JIRA-4208</a><span style='mso-spacerun:yes'>   </span>Add New Config C support in code
BUG_FIX: <a href="http://jira.xyz.com/browse/BUGJIRA-31">BUGJIRA-31</a><span style='mso-spacerun:yes'>   </span>Bugfix of some old bug
NEW_FEATURES: <a href="http://jira.xyz.com/browse/ZEERA-273">ZEERA-273</a><span style='mso-spacerun:yes'>   </span>Add support for some other feature.

Details from /projects/git/Changelog2.html file:
BUG_FIX: <a href="http://jira.xyz.com/browse/BUGJIRA-33">BUGJIRA-33</a><span style='mso-spacerun:yes'>   </span>Bugfix of an issue
NEW_FEATURES: <a href="http://jira.xyz.com/browse/JIRA-4209">JIRA-4209</a><span style='mso-spacerun:yes'>   </span>Add New Config D support in code

即。所有错误编号后跟其描述。

如果可能，我想将输出写在另一个文件JIRA-4208, BUGJIRA-31, ZEERA-273, BUGJIRA-33, JIRA-4209 : Add New Config C support in code, Bugfix of some old bug, Add support for some other feature, Bugfix of an issue, Add New Config D support in code

中

EDIT-1：

我的代码如下：

output.txt

输出是：

#!/usr/bin/perl
open (FILE, 'input_file1.txt') or die "Could not read from file, exit...";
while(<FILE>)
{
  chomp;
  ($junk0,$junk1,$junk2,$junk3,$junk4,$BUG_NUMBR) = split /[:<="">]+/,$_;
  print "$BUG_NUMBR \n";
}
close FILE;
exit;

这与上面显示的预期输出完全不同。我无法为预期输出的第二部分定义适当的正则表达式，这是对错误的简短描述。

Answer 1

您不需要正则表达式。您的split模式很有趣，但它可以完成工作。

也可以采取其余的结果。我用数组替换了你的$junk变量。 Perl允许您使用索引-1从右侧获取最后一个元素，因此将文本输出是微不足道的，因为它是在最后一个>之后。

use strict;
use warnings;

my ( @numbers, @text );
while (my $line = <DATA>) {
    chomp $line;
    my @stuff = split /[:<="">]+/, $line;
    push @numbers, $stuff[5];
    push @text, $stuff[-1];
}

print join ', ', @numbers;
print ' : ';
print join ', ', @text;

__DATA__
NEW_FEATURES: <a href="http://jira.xyz.com/browse/JIRA-4208">JIRA-4208</a><span style='mso-spacerun:yes'>   </span>Add New Config C support in code
BUG_FIX: <a href="http://jira.xyz.com/browse/BUGJIRA-31">BUGJIRA-31</a><span style='mso-spacerun:yes'>   </span>Bugfix of some old bug
NEW_FEATURES: <a href="http://jira.xyz.com/browse/ZEERA-273">ZEERA-273</a><span style='mso-spacerun:yes'>   </span>Add support for some other feature.
BUG_FIX: <a href="http://jira.xyz.com/browse/BUGJIRA-33">BUGJIRA-33</a><span style='mso-spacerun:yes'>   </span>Bugfix of an issue
NEW_FEATURES: <a href="http://jira.xyz.com/browse/JIRA-4209">JIRA-4209</a><span style='mso-spacerun:yes'>   </span>Add New Config D support in code

我还添加了严格和警告，并使你的变量有词汇。

另请注意，如果文字包含文字>或<或引号或其他内容，您的代码就会中断。这是一种奇怪的格式，而HTML解析器并不能真正帮助你。

Answer 2

上面提到的问题陈述的代码如下：

#!/usr/bin/perl

use strict;
use warnings;

open (FILE, 'perl_input_file1.txt') or die $!;
my ( @numbers, @text );
while (my $line = <FILE>) {
    chomp $line;
    $line =~ /^Details/ and next;
    my @stuff = split /[:<="">]+/, $line;
    push @numbers, $stuff[5];
    push @text, $stuff[-1];
}
close FILE;
print join ', ', @numbers;
print ': ';
print join ', ', @text;
print "\n";

此代码的输出为：

JIRA-4208, BUGJIRA-31, ZEERA-273, BUGJIRA-33, JIRA-4209: Add New Config C support in code, Bugfix of some old bug, Add support for some other feature, Bugfix of an issue, Add New Config D support in code

这与问题中提到的我期望的预期输出相同。

我想再次感谢@simbabque的指导和方法。

此致

解析HTML日志文件并获取特定格式的文本文件

2 个答案: