Question

我需要对一些结构不好的HTML进行后期处理 - 例如

<html>
<body>...</body>
<body>...</body>
</html>

转换此HTML的最佳方法是什么，以便第二个正文的内容出现在第一个正文中，当然除了额外的正文标记？我不想用这个规则操纵任何其他东西。

我想过在html标签上进行匹配并使用显式应用模板调用从那里进行处理，但对我来说似乎有点草率。我知道如何匹配虚假的身体（“body [position（）＆gt; 1]”）但我想了解如何最好地编写变换。

编辑：我确实需要将其他模板应用于所有这些元素的子元素，因此简单的副本将无效。

我想保留评论和处理说明。我希望整个文档几乎都是身份转换，除了这些多个实体和其他一些小编辑，我已经成功完成了。

编辑2：在上面的例子中保留第二个body元素的子元素很重要。它们应该是输出中第一个body标记的子元素，位于第一个body标记的子节点的末尾。

编辑3：这是一些说明性的输入/输出（未检查有效性）：

<html>
  <!-- Look at my comments -->
  <head>
    <title>My title!</title>
    <!-- Commentary -->
  </head>
  <body>
     <p>Something <b>bold</b></p>
  </body>
  <body>
     <!-- heh -->
     <p>Some bozo put my parent in here.</p>
  </body>
  <body>
     <p>More stuff here</p>
  </body>
</html>

需要：

<html>
  <!-- Look at my comments -->
  <head>
    <title>My title!</title>
    <!-- Commentary -->
  </head>
  <body>
     <p>Something <b>bold</b></p>
     <!-- heh -->
     <p>Some bozo put my parent in here.</p>
     <p>More stuff here</p>
  </body>
</html>

Answer 1

将这些模板添加到身份转换中：

<xsl:template match="/html/body[1]">
   <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
      <xsl:apply-templates select="/html/body[2]/node() | /html/body[2]/@*"/>
   </xsl:copy>
</xsl:template>

<xsl:template match="/html/body"/>

修改

要成为腰带和吊带，而不是上面的body[2]你可以使用body[position() != 1]。这将处理您的输入具有两个以上body元素的情况。

Answer 2

通常通过编写量身定制的黑客来避免下游问题导致代码库管理不善。

你应该最好在它的来源修复损坏的HTML，有几个身体标签听起来像某个地方的严重误解。

Answer 3

如果您的输入HTML是格式良好的XML，那么此XSLT模板将执行此操作：

<xsl:template match="/">
  <body>
    <xsl:copy-of select="//body/node()" />
  </body>
</xsl:template>

（我在这个例子中并不关心<html>节点，因为这很简单。）

上述更灵活的变体（根据OP的要求）

<!-- explicitly catching the initial html circumvents built-in templates -->
<xsl:template match="/html">
  <xsl:copy>
    <xsl:apply-templates />
  </xsl:copy>
</xsl:template>

<!-- copy everything that is not processed otherwise -->
<xsl:template match="@*|node()|processing-instruction()">
  <xsl:copy-of select="." />
</xsl:template>

<!-- matches any "body" node, but produces output only for the first -->
<xsl:template match="body">
  <xsl:if test="not(preceding-sibling::body)">
    <xsl:copy>
      <xsl:apply-templates select="//body/@*|//body/node()" />
    </xsl:copy>
  </xsl:if>
</xsl:template>

<!-- you can add more of these specific templates, as needed -->
<xsl:template match="body//a">
  <b>
    <xsl:copy-of select="." />
  </b>
</xsl:template>

此输入：

<html>
  <head><title>Foo!</title></head>
  <?dummy processing instruction?>
  <body foo="bar">...<a href="foo">asd</a><!-- comment --></body>
  <body>...contents of body#2...</body>
</html>

获取此结果（为便于阅读而改变了空格和缩进）：

<html>
  <head><title>Foo!</title></head>
  <?dummy processing instruction?>
  <body foo="bar">
    ...
    <b><a href="foo">asd</a></b>
    <!-- comment -->
    ...contents of body#2...
  </body>
</html>

Answer 4

也许这更接近你所追求的目标：

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
            version="2.0" exclude-result-prefixes="xsl">
<xsl:output indent="yes" method="html"/>

<xsl:template match="/">
    <xsl:apply-templates select="@*|node()"/>
</xsl:template>

<!-- Identity Template -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<!-- Matches on the first 'body' tag -->
<xsl:template match="body[1]">
    <xsl:copy>
        <!-- apply=templates the children of all the body tags -->
        <xsl:apply-templates select="//body/node()"/>
    </xsl:copy>
</xsl:template>

<!-- Skip processing on the subsequent body tags 
     (their children are still processed however)   -->
<xsl:template match="body"/>

</xsl:stylesheet>

这使用流行的“推送”结构作为模板，因此您可能会发现它更灵活。

Answer 5

我认为@Keltex意味着你应该剥离

</body>\s*<body>

在处理文档之前，以便您可以像编写标准化输入一样编写XSLT。

这就是我要做的事。

（这假设多个身体标签之间没有内容。）

编辑：这不会删除正文标记的内容。请注意，您要将结束正文标记中的任何内容删除到开头。这将留下初始和最终标签。换句话说，输入就像这样

<body>
    good stuff
</body>
<body>
    more good stuff
</body>

你会在中间瞄准这两个标签。删除这些将产生一个连续的身体：

<body>
    good stuff
    more good stuff
</body>

Answer 6

如果HTML搞砸了，那么我不愿意假设HTML已经很好地形成了使用xlst。您可能只想使用正则表达式来查找

<body>(whitespace)</body>

并将其删除。

XSLT：帮我修复多个BODY标签

6 个答案: