Question

我是perl的新手，最近为我们的SharePoint编写了一个转换器。它基本上需要我们旧的wiki的html页面，并将它们转换为带有SP类的aspx页面等等。

一切正常，直到某人使用<tags>作为文本。这是旧twiki的html示例：

<li> Moduldateinamen haben folgendes Format <code> <Content>_<Type>_<Name>_</code> ...

所以<Content> <Type> <Name>是包含在<code>标记

中的文字

在旧维基中看起来如何：

How it looks in old wiki

转换为aspx并上传到SharePoint后的样子： How it looks after converting to aspx and uploaded to SharePoint

你可以看到SP试图将它们解释为标签（当然），而不是文本，因此它不会被显示。

对于SharePoint页面，我需要SP ASPX标记之间的转义HTML标记。所以我改变了f.e.通过正则表达式<到<等等。

但是，我发布的示例代码段在ASPX中应该如下所示：

&lt;li&gt; Moduldateinamen haben folgendes Format &lt;code>&gt;openTagContentclosingTag_openTagTypeclosingTag_openTagNameclosingTag_&lt;/code&gt;

所以＆lt;转换为openTag和＆gt;关闭标记，但仅限于此<li>标记之间的实际内容。后来需要手动更改（我没有看到另一种方式）

我怎样才能实现只有“text”标签通过openTag / closingTag转义，但“真正的”HTML标记以这种方式<转义为<

Answer 1

我摆弄并找到可能工作的解决方案。它与示例数据一起使用，但我不知道您的实际文档有多复杂。

考虑这个示例输入。

my $html = <<'HTML';
<ul>
    <li><code>tecdata\de\modules:</code> Testbausteine blafasel</li>
    <li>
        <ul>
            <li> Moduldateinamen haben folgendes Format <code> <Content>_<Type>_<Name></code> (Ausnahme: <code>ti_<Name>)</li>
            <li>
                <ul>
                    <li><Content> bezeichnet den semantischen Inhalt</li>
                    <li><code>ti_</code> diese ganzen Listen sind verwirrend
                </ul>
            </li>
        </ul>
    </li>
<p>And more stuff here...</p>";
HTML

以下程序。

# we will save the tag-looking words here
my %non_tags;

# (8) replace html with the concatenanted result
$html = join '', map {
    my $string = $_;

    # (2) find where the end-tag is
    my $pos = index($string, '</code>');
    if ($pos >= 0) {
        # (3) take string until the end-tag
        my $escaped = substr( $string, 0, $pos );

        # (4) remember the tag-looking words
        $non_tags{$_}++ foreach $escaped =~ m/<([^>]+)>/g;

        # (5) html-escape the <>
        $escaped =~ s/</&lt;/g;
        $escaped =~ s/>/&gt;/g;

        # (6) overwrite the not-escaped part with the newly escaped string
        substr( $string, 0, $pos ) = $escaped;
    }
    $string;
} split m/(<code>)/, $html; # (1) split on <code> but also take that delimiter

# html-escape those tag-looking words all over the text
foreach my $word ( keys %non_tags) {
    $html =~ s/<($word)>/&lt;$1&gt;/g ;
}

print $html;

该输出如下。

<ul>
    <li><code>tecdata\de\modules:</code> Testbausteine blafasel</li>
    <li>
        <ul>
            <li> Moduldateinamen haben folgendes Format <code> &lt;Content&gt;_&lt;Type&gt;_&lt;Name&gt;</code> (Ausnahme: <code>ti_&lt;Name&gt;)</li>
            <li>
                <ul>
                    <li>&lt;Content&gt; bezeichnet den semantischen Inhalt</li>
                    <li><code>ti_</code> diese ganzen Listen sind verwirrend
                </ul>
            </li>
        </ul>
    </li>
<p>And more stuff here...</p>";

正如您所看到的，它html-escape了<code></code>令牌中所有类似标签的单词。它还记得那些单词是什么，然后替换了那些看起来像标签的单词的更多出现。这样，我们就不会弄乱实际的HTML。

这是一种非常天真的方法，但由于这是一次性任务，所以有效的垃圾解决方案比没有解决方案更好。

Answer 2

据我理解正确的问题，你需要的只是一个正则表达式：

$page =~ s{(?<=<code>)(.+?)(?=<\/code>)}
          {
              my $text = $1;
              $text =~ s/([<>])/ $1 eq '<' ? '&lt;': '&gt;'/ge;
              $text;
          }gxe;

转义HTML

2 个答案: