Question

我们正在开发内部工具来为我们的.NET产品生成文档。

作为其功能的一部分，我们需要使用<para>标记包装正常段落。

在此上下文中，“普通段落”表示它是一行文本，可能包含一些内联类似XML的标记，但不在其他块标记内，如<cell>或<description>。< / p>

源文件的示例：

Description paragraph #1.
Description paragraph #2.
<code>
Method1();
Method2();
</code>
<list type="number">
  <item>
    <description>
      If you need to do something, use the <see cref="P:foo1" /> method.
    </description>
  </item>
  <item>
    <description> The <see cref="P:foo2" /> method does this.
The <see cref="P:foo3" /> method does that.</description>
  </item>
</list>

<section>
<title>Section title</title>
<content>
Section paragraph #1.
Section paragraph #2.
</content>
</section>

这应该转换为以下内容：

<para>Description paragraph #1.</para>
<para>Description paragraph #2.</para>
<code>
Method1();
Method2();
</code>
<list type="number">
  <item>
    <description>
      If you need to do something, use the <see cref="P:foo1" /> method.
    </description>
  </item>
  <item>
    <description> The <see cref="P:foo2" /> method does this.
The <see cref="P:foo3" /> method does that.</description>
  </item>
</list>

<section>
<title>Section title</title>
<content>
<para>Section paragraph #1.</para>
<para>Section paragraph #2.</para>
</content>
</section>

正式地说，任务听起来像这样：用...包装每一行文本，但不仅仅是它不在其他标签的有限列表中。标签中的每个未来段落都允许使用CR / LF，制表符，空格字符等空格。

显然，应该使用正则表达式，但是我们还没有设法为这种情况构建一些东西。任何想法或提示？

Answer 1

你说“显然应该使用正则表达式”。很多人会说你在这个断言中错过了“不”。请参阅this well known answer。

如果您确定没有外层标记的嵌套，您可能能够拆分一些可怕的正则表达式，如：

(<list([^<]|<(?!/list))+</list>)|(<code([^<]|<(?!/code))+</code>)|([^\n]+)

并替换非标记部分的匹配项。但实际上，为什么不使用众多XML解析器中的一个并简单地替换相应的文本节点？

Answer 2

很难从您的示例中推断出完整的需求，但如果您的示例是典型的，那么在将提供的内容包装在<wrapper>元素中之后，以下XSLT 2.0样式表将完成这项工作使它格式良好：

<xsl:template match="/wrapper/*">
  <xsl:copy-of select="."/>
</xsl:template>

<xsl:template match="/wrapper/text()">
  <xsl:for-each select="tokenize(., '\n')">
    <para><xsl:copy-of select="."/></para>
  </xsl:for-each>
</xsl:template>

使用正则表达式使用XML标记包装文本的一部分

2 个答案: