将格式错误的HTML转换为分层XML

时间:2018-08-23 18:04:55

标签: xml perl

我目前正在Perl中寻找一种在XML文件中写入以下输出的方法

  • h1是父级

  • h2h1

  • 的子级别
  • h3h2的子级别(或h1的子级别)等。

示例输入

<h1>1 Top level heading
Para text 1
Para text 2
<h2>1.1 Sub level heading
Para text 3
Para text 4
<h3>1.1.1 Sub sub level heading
Para text 5
Para text 6
<h2>Sub level heading 2
Para text 7
Para text 8
<h1>Top level heading
Para text 1
Para text 2

必需的输出

<h1>
 <label>1</label>
 <title>Top level heading</title>
 <p>Para text 1</p>
 <p>Para text 2</p>

 <h2>
  <label>1.1</label>
  <title>Sub level heading</title>
  <p>Para text 3</p>
  <p>Para text 4</p>

  <h3>
    <label>1.1</label>
    <title>Sub sub level heading</title>
    <p>Para text 5</p>
    <p>Para text 6</p>
  </h3>
 </h2>

 <h2>Sub level heading (no number prefix)
  <p>Para text 7</p>
  <p>Para text 8</p>
 </h2>
</h1>

<h1>Top level heading (no number prefix)
<p>Para text 9</p>
<p>Para text 10</p>
</h1>

我尝试了很多,但没有找到实现这一目标的逻辑。

有人可以帮助我入门吗?

更新

@Borodin的代码基于以上输入片段效果很好,但是我的实际要求如下:

Input.txt

<art>Ärticle Title
<smry>1 Summåry
 Summary paragragh 1...
 Summary paragragh 2...
</smry>
<subjg>Subject Group Title
 subject 1; subject 2; subject 3
</subjg>

<h1>1 Top level heading
  Para text 1
  <img gr1.jpg>
  Para text 2

  <h2>1.1 Sub level heading
    Para text 3
    Para text 4
    <img gr2.jpg>

  <h2>1.2 Sub level heading
    Para text 5
    Para text 6

   <h3>1.1.1 Sub sub level heading
     Para text 7
     <fcap>Label 1: Text...
     <grp line1.png>
     Para text 8

   <h3>1.1.2 Sub sub level heading
     Para text 9
     Para text 10
  <h2>Sub level heading
    <fcap>Text only...
    <grp line2.png>
    Para text 11
    Para text 12

<h1>Top level heading
 Para text 13
 Para text 14

  <h2>Sub level heading
    Para text 15
    Para text 16

<blst>Books
 [1] Book name 1...
 [2] Book name 2...
 [3] Book name 3...
</blst>

<art>
...
<art>
...

必需的Output.xml

<?xml version="1.0" encoding="UTF-8"?>
<article>
  <front>
    <title>&#x00C4;rticle Title</title>
    <summary>
      <label>1</label>
      <title>Summ&#x00E5;ry</title>
      <p>Summary paragragh 1...</p>
      <p>Summary paragragh 2...</p>
    </summary>
    <subj-group>
      <title>Subject Group Title</title>
      <sub>subject 1</sub>
      <sub>subject 2</sub>
      <sub>subject 3</sub>
    </subj-group>
  </front>
  <body>
    <h1 id="s1">
      <label>1</label>
      <title>Top level heading</title>
      <p>Para text 1</p>
      <img src="gr1.jpg" id="gr1"/>
      <p>Para text 2</p>
      <h2 id="s1a">
        <label>1.1</label>
        <title>Sub level heading</title>
        <p>Para text 3</p>
        <p>Para text 4</p>
        <img src="gr2.jpg" id="gr2"/>
      </h2>
      <h2 id="s1b">
        <label>1.2</label>
        <title>Sub level heading</title>
        <p>Para text 5</p>
        <p>Para text 6</p>
        <h3 id="s1b1">
          <label>1.1.1</label>
          <title>Sub sub level heading</title>
          <p>Para text 7</p>
          <figure id="grp1">
            <label>Label 1:</label>
            <cap><p>Text...</p></cap>
            <graphic src="line1.png"/>
          </figure>
          <p>Para text 8</p>
        </h3>
        <h3 id="s1b2">
          <label>1.1.2</label>
          <title>Sub sub level heading</title>
          <p>Para text 9</p>
          <p>Para text 10</p>
        </h3>
      </h2>
      <h2 id="s1c">
        <title>Sub level heading 2</title>
        <figure id="grp2">
          <cap><p>Text only...</p></cap>
          <graphic src="line2.png"/>
        </figure>
        <p>Para text 11</p>
        <p>Para text 12</p>
      </h2>
    </h1>
    <h1 id="s2">
      <title>Top level heading</title>
      <p>Para text 13</p>
      <p>Para text 14</p>
      <h2 id="s2a">
        <title>Sub level heading 2</title>
        <p>Para text 15</p>
        <p>Para text 16</p>
      </h2>
    </h1>
  </body>
  <back>
    <booklist>
      <title>Books</title>
      <bookname id="b1"><l>[1]</l><t>Book name 1...</t></bookname>
      <bookname id="b2"><l>[2]</l><t>Book name 2...</t></bookname>
      <bookname id="b3"><l>[3]</l><t>Book name 3...</t></bookname>
    </booklist>
  </back>
</article>

有人可以帮我吗?

2 个答案:

答案 0 :(得分:2)

说了很难,我认为我至少能提供解决方案!

我添加了一些评论,我希望这几乎是不言自明的

请注意,它会忽略除<h1>等之外的所有HTML标记,并且我没有尝试添加显示的空白行,因为它们背后似乎没有任何逻辑

我想知道这是否真的是您想要的,因为在<h1>元素中放置多个段落是很奇怪的。无论如何,我希望这会有所帮助


注意

我非常确定,仅通过对先前级别的标量计数就可以完成此操作。我开始以这种方式进行编码,但最终使用了堆栈,因为它有助于我的思考,但是因为@stack仅包含1..3等。我认为使用等价于@stack中元素的数量,并对其进行递增和递减以代替数组pushingpopping

use strict;
use warnings 'all';
use autodie;

# Read the file and split it on the header tags

my @blocks = do {
    open my $fh, '<', 'input.html';
    local $/;
    grep /\S/, split /(<h\d>)/, <$fh>;
};

my @stack;

while ( @blocks ) {

    my $tag  = shift @blocks;
    my $text = shift @blocks;
    my @text = split /\n/, $text;

    s/\A\s+|\s+\z//g for @text;  # Trim text lines

    die unless $tag =~ /h(\d+)/; # Check well-formed tag
    my $level = $1;              # and grab hierarchy level

    # Close all outstanding tags until we reach this level
    while ( @stack and $stack[-1] >= $level ) {
        my $l = $stack[-1];
        print indent($l-1), "</h$l>\n";
        pop @stack;
    }

    # Opening tag, on its own or with label and title if they're there
    if ( $text[0] =~ /^\b[\d.]+\b/ ) {

        print indent($level-1), $tag, "\n";

        my ($label, $title) = split ' ', shift(@text), 2;

        print indent($level), $_, "\n" for
                "<label>$label</label>",
                "<title>$title</title>";
    }
    else {
        print indent($level-1), $tag, shift @text, "\n";
    }

    # Print the remaining text lines as paragraphs                
    print indent($level), $_, "\n" for map { "<p>$_</p>" } @text;

    # Remember that this tag needs closing
    push @stack, $level;
}

# Close all outstanding tags
while ( @stack ) {
    my $l = $stack[-1];
    print indent($l-1), "</h$l>\n";
    shift @stack;
}


sub indent {
    my $n = shift;
    '  ' x $n;
}

输出

<h1>
  <label>1</label>
  <title>Top level heading</title>
  <p>Para text 1</p>
  <p>Para text 2</p>
  <h2>
    <label>1.1</label>
    <title>Sub level heading</title>
    <p>Para text 3</p>
    <p>Para text 4</p>
    <h3>
      <label>1.1.1</label>
      <title>Sub sub level heading</title>
      <p>Para text 5</p>
      <p>Para text 6</p>
    </h3>
  </h2>
  <h2>Sub level heading 2
    <p>Para text 7</p>
    <p>Para text 8</p>
  </h2>
</h1>
<h1>Top level heading
  <p>Para text 1</p>
  <p>Para text 2</p>
</h1>

答案 1 :(得分:-1)

不需要自行打印XML,包括 压痕处理。我认为,更简单的解决方案是 使用专用的模块,例如 XML :: Writer

下面有一个建议的程序修订版 由 Borodin 使用,仅使用 XML :: Writer

use strict; use warnings; use autodie; use XML::Writer;

my @stack;
my $wr = XML::Writer->new(OUTPUT => 'self', DATA_MODE => 1,
    DATA_INDENT => 2, UNSAFE => 1);

sub endTags {
    my $lev = shift;
    while (@stack and $stack[-1] >= $lev) {
        pop(@stack);
        $wr->endTag();
    }
}

my @blocks = do {
    open my $fh, '<', 'input.txt';
    local $/;   # Slurp mode
    grep /\S/, split /<(h\d)>/, <$fh>;
};
$wr->startTag('main');
push @stack, 0;         # Treat "main" as 0 level node
while (@blocks) {
    my $tag  = shift @blocks;   # Tag name
    my $text = shift @blocks;   # Content (up to the next <h...>)
    my @text = split /\n/, $text;
    s/\A\s+|\s+\z//g for @text;
    die unless $tag =~ /h(\d)/;
    my $level = $1;
    endTags($level);
    push @stack, $level;
    $wr->startTag($tag);
    if ($text[0] =~ /^\b[\d.]+\b/) {
        my ($label, $title) = split ' ', shift(@text), 2;
        $wr->dataElement(label => $label);
        $wr->dataElement(title => $title);
    } else {
        $wr->characters(shift(@text) . ' (no number prefix)');
    }
    $wr->dataElement('p' => $_) for @text;
}
endTags(0);
my $xml = $wr->end();
print $xml;

如您所见,有些片段是相同的(无需重新发明 轮子),但例如XML标签的结束(结束)位置已移至 一个专门的函数,调用了两次。

此程序还符合有关 正确的XML格式,即XML文件必须具有 单个根级节点(在这里我称为 main )。

我必须在 XML :: Writer 中设置 UNSAFE 选项,否则它会抱怨 关于混合内容(包含文本节点和子元素的元素)。

一个相当巧妙的技巧是我还使用了 endTags 函数来结束 main 标签。之所以可能,是因为 XML :: Writer 跟踪标记 用户打开的名称,因此 endTag 函数实际上不需要 要关闭的标签的名称。