Question

我收到了一段凌乱的HTML，我用HTML整理了它。我想把它变成DITA的一个版本。

我想获得带有文本的第一个元素，并将其转换为章节标题。

我有一个文件（简化）：

<html><head></head>
<body>
<p><img src="i.gif" alt="int.gif (792 bytes)" border="0" width="105" height="18" /> 
    <strong>
       <a class="c1" name="flag" id="flag">Flags</a>
     </strong>
 </p>
<!-- the elements between the first p and the actual text may vary. -->
<!--more -->

或者有时它是这样的：

  <html><head></head>
    <body>
    <table border="0" cellpadding="3" cellspacing="0" width="100%">
    <tbody> <!-- sometimes this is missing !! -->
    <tr>
    <td class="c3" width="100%">
        <span class="c2">
            <a class="c1" name="Errors" id="errors">Error-Codes</a>   <strong>with troubleshooting</strong>
        </span>
     </td></tr></tbody></table>  <!--more --></body></html>

或者可能是其他东西。

我试过这些：

<xsl:template match="body">
    <xsl:element name="chapter">
        <xsl:element name="title">
            <!-- <xsl:value-of select="table[1]//td[1]"/> first td, but not p -->
            <!-- <xsl:value-of select="./p[1]//text()"/> first para
            <!-- <xsl:value-of select="table[1]//td()[1] or p[1]"/> invalid syntax -->
            <!-- <xsl:value-of select="text()[1]"/>  nothing -->
            <!-- <xsl:value-of select="//text()[1]"/> gets all text in document -->
        </xsl:element>

我也试过

<!--  <xsl:value-of select=".//*[@class='c1'][1]"/> gets first instance of child node with class="c1" of every subnode, with are often many -->

受欢迎的要求;-)这就是我想要的：

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE chapter SYSTEM "our.dtd">
<chapter template-version="01">
    <title>Flags</title>
<!-- blabbity blab -->
</chapter>

或

<chapter template-version="01">
    <title>Error codes with troubleshooting</title>
      <!-- I would also accept just "Error codes", 
           I could leave some billable work for later -->
<!-- blabbity blab -->
</chapter>

Answer 1

我想获得带有文本的第一个元素，并将其转换为章节标题。

这听起来并不那么容易。什么是＆＃34;其中包含文字的第一个元素＆＃34; ，无论如何？

在你的第一个例子中，它将是：

<a class="c1" name="flag" id="flag">Flags</a>

够容易。在你的第二个例子中，按照相同的逻辑，它将是：

<a class="c1" name="Errors" id="errors">Error-Codes</a>

但当然不是那么容易，因为你真的想要这个：

<span class="c2">
    <a class="c1" name="Errors" id="errors">Error-Codes</a>   <strong>with troubleshooting</strong>
</span>

那么你想要用作标题的元素的定义特征是什么？

我会做出有根据的猜测并将其定义为：

第一个非内联元素，不包含其他非内联元素且包含非空文本。

＆＃34;非在线＆＃34;表示所有block-level elements以及<td>等等，它们与在这种情况下无关的块级元素存在技术差异。

因此，在第一个示例中使用此定义会让我们：

<p><img src="i.gif" alt="int.gif (792 bytes)" border="0" width="105" height="18" /> 
    <strong>
       <a class="c1" name="flag" id="flag">Flags</a>
     </strong>
</p>

其文字值仍为＆＃34; Flags＆＃34;。

在你的第二个例子中，我们最终得到的元素是：

<td class="c3" width="100%">
    <span class="c2">
        <a class="c1" name="Errors" id="errors">Error-Codes</a>   <strong>with troubleshooting</strong>
    </span>
</td>

其文本值为＆＃34;带故障排除的错误代码＆＃34;。

似乎该定义适用于您提供的示例。

XPath匹配所有相关的＆＃34;非内联＆＃34;元素可能如下所示：

//*[self::p|self::td|self::div|self::and-so-on]

根据需要添加更多容器元素类型。

当我们包含不应包含相同类型的其他元素的条件时，我们最终得到：

//*[self::p|self::td|self::div|self::and-so-on][
    not(.//*[self::p|self::td|self::div|self::and-so-on])
]

添加必须包含一些文字的条件：

//*[self::p|self::td|self::div|self::and-so-on][
    not(.//*[self::p|self::td|self::div|self::and-so-on])
    and normalize-space() != ''
]

......以及在整个文件中满足这一条件的所有人，我们只需要第一个：

(//*[self::p|self::td|self::div|self::and-so-on][
    not(.//*[self::p|self::td|self::div|self::and-so-on])
    and normalize-space() != ''
])[1]

和第一个，我们想要标准化的文本值：

normalize-space(
    (//*[self::p|self::td|self::div|self::and-so-on][
       not(.//*[self::p|self::td|self::div|self::and-so-on])
        and normalize-space() != ''
    ])[1]
)

XSLT中的所有这些：

<xsl:template match="body">
  <title>
    <xsl:value-of select="
        normalize-space(
            (//*[self::p|self::td|self::div|self::and-so-on][
               not(.//*[self::p|self::td|self::div|self::and-so-on])
                and normalize-space() != ''
            ])[1]
        )
    " />
  </title>
</xsl:template>

XSLT找到第一个文本节点

1 个答案: