Question

我需要帮助在PHP中使用XPath做一些事情。

对于任何给定的HTML，我需要：

删除所有表格及其内容
删除第一个h1标记后的所有内容
仅保留段落（包括其内部HTML（链接，列表等））

使用正则表达式，我让一切都运转良好。但是，当我遇到嵌套表时，我认为用正则表达式解析HTML确实是愚蠢的。

非常感谢！

Answer 1

对于任何给定的HTML，我需要：

•删除所有表格及其内容

•在第一个h1之后删除所有内容   标签

•仅保留段落（包括   他们的内在HTML（链接，列表等））

使用XSLT可以非常轻松地完成：

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:h="http://www.w3.org/1999/xhtml" >
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <!-- Copy every node except when overriden
      by another template -->
 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <!-- Remove all tables and their contents -->
 <xsl:template match="h:table"/>

 <!-- Remove everything after the first h1 -->
 <xsl:template match="node()[preceding::h:h1]"/>

 <!-- Keep only paragraphs (INCLUDING
      their inner HTML (links, lists, etc))
  -->
 <xsl:template match=
 "node()[not(self::h:p) and not(ancestor::h:p)]">
  <xsl:apply-templates/>
 </xsl:template>
</xsl:stylesheet>

如果你的元素名称不在XHtml命名空间中，只需删除上述代码中h: 的任何出现。

Answer 2

考虑使用HTML DOM解析器，因为这将比XML更容易。有一些解析器也支持xpath语句。但棘手的部分是并非所有HTML都符合严格的xhtml标准，因此规则并不总是易于应用。这是我遇到的几个DOM解析器。一些支持xpath，有些只是选择内容的其他方式：

http://simplehtmldom.sourceforge.net/

http://php.net/manual/en/simplexmlelement.xpath.php

帮助PHP和XPath

2 个答案: