Question

我有大量由MS Word创建的HTML文件。我试图操纵这些文件的内容来提取数据等等。

HTML段落内容混杂，我发现斜体或粗体字后面的空格通常也是斜体。当我normalize-space()之后，空格被剥离，并且连接的单词不应该连接起来。

<p>Some text here and some <i>italicized </i>text here.</p>

稍后转换会导致此变为

<p>Some text here and some <i>italicized</i>text here.</p>

（我在某种程度上简化了事情。）

我想以

结束

<p>Some text here and some <i>italicized</i> text here.</p>

我想确定这样一种情况，即元素中的最后一个节点是以空格结尾的文本节点，剥离尾随空格，并在元素后面添加一个空格。

我想我可以拼凑一些东西，但XQuery变得毛茸茸，我不得不认为有一种更简单的方法。（可能没有，但如果我不问，我会很傻。）

XSLT, finding out if last child node is a specific element看起来很近，但并不完全存在。

Answer 1

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

  <xsl:template match="@*|node()">
      <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
      </xsl:copy>
  </xsl:template>

  <!--Match the elements who's last child node is a text() node 
      that ends with a space. -->
  <xsl:template match="*[node()[last()]
                               [self::text()[substring(.,string-length())=' ']]]">
      <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
      </xsl:copy>
      <!--add the extra space following the matched element-->
      <xsl:text> </xsl:text>
  </xsl:template>

  <!--Match the text() node that is the last child node of an element 
      and ends with a space -->
  <xsl:template match="*/node()[last()]
                               [self::text()[substring(., string-length())=' ']]">
      <!--remove the trailing space-->
      <xsl:value-of select="substring(., 0, string-length())"/>
  </xsl:template>

</xsl:stylesheet>

将最后一个空间从混合内容节点移动到外部节点

1 个答案: