我目前正在Perl中寻找一种在XML文件中写入以下输出的方法
h1
是父级
h2
是h1
h3
是h2
的子级别(或h1
的子级别)等。
<h1>1 Top level heading
Para text 1
Para text 2
<h2>1.1 Sub level heading
Para text 3
Para text 4
<h3>1.1.1 Sub sub level heading
Para text 5
Para text 6
<h2>Sub level heading 2
Para text 7
Para text 8
<h1>Top level heading
Para text 1
Para text 2
<h1>
<label>1</label>
<title>Top level heading</title>
<p>Para text 1</p>
<p>Para text 2</p>
<h2>
<label>1.1</label>
<title>Sub level heading</title>
<p>Para text 3</p>
<p>Para text 4</p>
<h3>
<label>1.1</label>
<title>Sub sub level heading</title>
<p>Para text 5</p>
<p>Para text 6</p>
</h3>
</h2>
<h2>Sub level heading (no number prefix)
<p>Para text 7</p>
<p>Para text 8</p>
</h2>
</h1>
<h1>Top level heading (no number prefix)
<p>Para text 9</p>
<p>Para text 10</p>
</h1>
我尝试了很多,但没有找到实现这一目标的逻辑。
有人可以帮助我入门吗?
@Borodin的代码基于以上输入片段效果很好,但是我的实际要求如下:
<art>Ärticle Title
<smry>1 Summåry
Summary paragragh 1...
Summary paragragh 2...
</smry>
<subjg>Subject Group Title
subject 1; subject 2; subject 3
</subjg>
<h1>1 Top level heading
Para text 1
<img gr1.jpg>
Para text 2
<h2>1.1 Sub level heading
Para text 3
Para text 4
<img gr2.jpg>
<h2>1.2 Sub level heading
Para text 5
Para text 6
<h3>1.1.1 Sub sub level heading
Para text 7
<fcap>Label 1: Text...
<grp line1.png>
Para text 8
<h3>1.1.2 Sub sub level heading
Para text 9
Para text 10
<h2>Sub level heading
<fcap>Text only...
<grp line2.png>
Para text 11
Para text 12
<h1>Top level heading
Para text 13
Para text 14
<h2>Sub level heading
Para text 15
Para text 16
<blst>Books
[1] Book name 1...
[2] Book name 2...
[3] Book name 3...
</blst>
<art>
...
<art>
...
<?xml version="1.0" encoding="UTF-8"?>
<article>
<front>
<title>Ärticle Title</title>
<summary>
<label>1</label>
<title>Summåry</title>
<p>Summary paragragh 1...</p>
<p>Summary paragragh 2...</p>
</summary>
<subj-group>
<title>Subject Group Title</title>
<sub>subject 1</sub>
<sub>subject 2</sub>
<sub>subject 3</sub>
</subj-group>
</front>
<body>
<h1 id="s1">
<label>1</label>
<title>Top level heading</title>
<p>Para text 1</p>
<img src="gr1.jpg" id="gr1"/>
<p>Para text 2</p>
<h2 id="s1a">
<label>1.1</label>
<title>Sub level heading</title>
<p>Para text 3</p>
<p>Para text 4</p>
<img src="gr2.jpg" id="gr2"/>
</h2>
<h2 id="s1b">
<label>1.2</label>
<title>Sub level heading</title>
<p>Para text 5</p>
<p>Para text 6</p>
<h3 id="s1b1">
<label>1.1.1</label>
<title>Sub sub level heading</title>
<p>Para text 7</p>
<figure id="grp1">
<label>Label 1:</label>
<cap><p>Text...</p></cap>
<graphic src="line1.png"/>
</figure>
<p>Para text 8</p>
</h3>
<h3 id="s1b2">
<label>1.1.2</label>
<title>Sub sub level heading</title>
<p>Para text 9</p>
<p>Para text 10</p>
</h3>
</h2>
<h2 id="s1c">
<title>Sub level heading 2</title>
<figure id="grp2">
<cap><p>Text only...</p></cap>
<graphic src="line2.png"/>
</figure>
<p>Para text 11</p>
<p>Para text 12</p>
</h2>
</h1>
<h1 id="s2">
<title>Top level heading</title>
<p>Para text 13</p>
<p>Para text 14</p>
<h2 id="s2a">
<title>Sub level heading 2</title>
<p>Para text 15</p>
<p>Para text 16</p>
</h2>
</h1>
</body>
<back>
<booklist>
<title>Books</title>
<bookname id="b1"><l>[1]</l><t>Book name 1...</t></bookname>
<bookname id="b2"><l>[2]</l><t>Book name 2...</t></bookname>
<bookname id="b3"><l>[3]</l><t>Book name 3...</t></bookname>
</booklist>
</back>
</article>
有人可以帮我吗?
答案 0 :(得分:2)
说了很难,我认为我至少能提供解决方案!
我添加了一些评论,我希望这几乎是不言自明的
请注意,它会忽略除<h1>
等之外的所有HTML标记,并且我没有尝试添加显示的空白行,因为它们背后似乎没有任何逻辑
我想知道这是否真的是您想要的,因为在<h1>
元素中放置多个段落是很奇怪的。无论如何,我希望这会有所帮助
我非常确定,仅通过对先前级别的标量计数就可以完成此操作。我开始以这种方式进行编码,但最终使用了堆栈,因为它有助于我的思考,但是因为@stack
仅包含1..3
等。我认为使用等价于@stack
中元素的数量,并对其进行递增和递减以代替数组pushing
和popping
use strict;
use warnings 'all';
use autodie;
# Read the file and split it on the header tags
my @blocks = do {
open my $fh, '<', 'input.html';
local $/;
grep /\S/, split /(<h\d>)/, <$fh>;
};
my @stack;
while ( @blocks ) {
my $tag = shift @blocks;
my $text = shift @blocks;
my @text = split /\n/, $text;
s/\A\s+|\s+\z//g for @text; # Trim text lines
die unless $tag =~ /h(\d+)/; # Check well-formed tag
my $level = $1; # and grab hierarchy level
# Close all outstanding tags until we reach this level
while ( @stack and $stack[-1] >= $level ) {
my $l = $stack[-1];
print indent($l-1), "</h$l>\n";
pop @stack;
}
# Opening tag, on its own or with label and title if they're there
if ( $text[0] =~ /^\b[\d.]+\b/ ) {
print indent($level-1), $tag, "\n";
my ($label, $title) = split ' ', shift(@text), 2;
print indent($level), $_, "\n" for
"<label>$label</label>",
"<title>$title</title>";
}
else {
print indent($level-1), $tag, shift @text, "\n";
}
# Print the remaining text lines as paragraphs
print indent($level), $_, "\n" for map { "<p>$_</p>" } @text;
# Remember that this tag needs closing
push @stack, $level;
}
# Close all outstanding tags
while ( @stack ) {
my $l = $stack[-1];
print indent($l-1), "</h$l>\n";
shift @stack;
}
sub indent {
my $n = shift;
' ' x $n;
}
<h1>
<label>1</label>
<title>Top level heading</title>
<p>Para text 1</p>
<p>Para text 2</p>
<h2>
<label>1.1</label>
<title>Sub level heading</title>
<p>Para text 3</p>
<p>Para text 4</p>
<h3>
<label>1.1.1</label>
<title>Sub sub level heading</title>
<p>Para text 5</p>
<p>Para text 6</p>
</h3>
</h2>
<h2>Sub level heading 2
<p>Para text 7</p>
<p>Para text 8</p>
</h2>
</h1>
<h1>Top level heading
<p>Para text 1</p>
<p>Para text 2</p>
</h1>
答案 1 :(得分:-1)
不需要自行打印XML,包括 压痕处理。我认为,更简单的解决方案是 使用专用的模块,例如 XML :: Writer 。
下面有一个建议的程序修订版 由 Borodin 使用,仅使用 XML :: Writer 。
use strict; use warnings; use autodie; use XML::Writer;
my @stack;
my $wr = XML::Writer->new(OUTPUT => 'self', DATA_MODE => 1,
DATA_INDENT => 2, UNSAFE => 1);
sub endTags {
my $lev = shift;
while (@stack and $stack[-1] >= $lev) {
pop(@stack);
$wr->endTag();
}
}
my @blocks = do {
open my $fh, '<', 'input.txt';
local $/; # Slurp mode
grep /\S/, split /<(h\d)>/, <$fh>;
};
$wr->startTag('main');
push @stack, 0; # Treat "main" as 0 level node
while (@blocks) {
my $tag = shift @blocks; # Tag name
my $text = shift @blocks; # Content (up to the next <h...>)
my @text = split /\n/, $text;
s/\A\s+|\s+\z//g for @text;
die unless $tag =~ /h(\d)/;
my $level = $1;
endTags($level);
push @stack, $level;
$wr->startTag($tag);
if ($text[0] =~ /^\b[\d.]+\b/) {
my ($label, $title) = split ' ', shift(@text), 2;
$wr->dataElement(label => $label);
$wr->dataElement(title => $title);
} else {
$wr->characters(shift(@text) . ' (no number prefix)');
}
$wr->dataElement('p' => $_) for @text;
}
endTags(0);
my $xml = $wr->end();
print $xml;
如您所见,有些片段是相同的(无需重新发明 轮子),但例如XML标签的结束(结束)位置已移至 一个专门的函数,调用了两次。
此程序还符合有关 正确的XML格式,即XML文件必须具有 单个根级节点(在这里我称为 main )。
我必须在 XML :: Writer 中设置 UNSAFE 选项,否则它会抱怨 关于混合内容(包含文本节点和子元素的元素)。
一个相当巧妙的技巧是我还使用了 endTags 函数来结束 main 标签。之所以可能,是因为 XML :: Writer 跟踪标记 用户打开的名称,因此 endTag 函数实际上不需要 要关闭的标签的名称。