使用XSLTProcessor从HTML中提取一个表

时间:2013-04-16 23:46:21

标签: php xslt

我正在尝试使用XML格式的启用类的表来获取表的内容。

我的PHP代码是:

<?php

// Load the XML source
$xml = new DOMDocument;
$out = $xml->load("collection.html");

$xsl = new DOMDocument;
$xsl->load('collection.xsl');

// Configure the transformer
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl); // attach the xsl rules

$xml = $proc->transformToXML($xml);

$xml = simplexml_load_string($xml);

print_r($xml);

?>

collection.html HTML是:

<table>
    <thead>
        <tr>
            <th>A</th>
        </tr>
        <tbody>
        <tr>
            <td>B</td>
        </tr>
        </tbody>
    </thead>
</table>

<table class="sticky-enabled">
 <thead><tr><th>Date</th><th>Time</th><th>Location</th><th>Tracking Event</th> </tr></thead>
<tbody>
 <tr class="odd"><td>16-04-2013</td><td>19:20</td><td>International Hub</td><td>Forwarded for export</td> </tr>
 <tr class="even"><td>16-04-2013</td><td>18:53</td><td>International Hub</td><td>Received and processed</td> </tr>
 <tr class="odd"><td>15-04-2013</td><td>17:28</td><td>Manchester Piccadilly Depot</td><td>Collected from customer</td> </tr>
 <tr class="even"><td>15-04-2013</td><td>00:00</td><td>WDM Online</td><td></td> </tr>
</tbody>
</table>

<table>
    <thead>
        <tr>
            <th>A</th>
        </tr>
        <tbody>
        <tr>
            <td>B</td>
        </tr>
        </tbody>
    </thead>
</table>

最后,collection.xsl是:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">
  <output>
    <xsl:for-each select="table[@class='sticky-enabled']/tbody/tr">
      <tracking>
        <date><xsl:value-of select="td[1]" /></date>
        <time><xsl:value-of select="td[2]" /></time>
        <event><xsl:value-of select="td[3]" /></event>
        <extra><xsl:value-of select="td[4]" /></extra>        
      </tracking> 
    </xsl:for-each>
  </output>    
  </xsl:template>
</xsl:stylesheet>

如果我运行这个,那么$ xml是空的。如果我编辑collection.html并删除第一个和最后一个表(即只是离开我试图访问的那个),那么它的工作原理。我怀疑问题是:

<xsl:for-each select="table[@class='sticky-enabled']/tbody/tr">

1 个答案:

答案 0 :(得分:0)

您的“XML”格式不正确。因此,无法使用XSLT对其进行解析和转换。 XML文档必须具有单个文档元素。您有三个<table>元素是兄弟姐妹。删除其他表会生成格式良好的XML文件,可以进行转换。

尝试使用XML元素包装表。

例如:

<doc>
  <table>
    <thead>
        <tr>
            <th>A</th>
        </tr>
        <tbody>
        <tr>
            <td>B</td>
        </tr>
        </tbody>
    </thead>
</table>

<table class="sticky-enabled">
 <thead><tr><th>Date</th><th>Time</th><th>Location</th><th>Tracking Event</th> </tr></thead>
<tbody>
 <tr class="odd"><td>16-04-2013</td><td>19:20</td><td>International Hub</td><td>Forwarded for export</td> </tr>
 <tr class="even"><td>16-04-2013</td><td>18:53</td><td>International Hub</td><td>Received and processed</td> </tr>
 <tr class="odd"><td>15-04-2013</td><td>17:28</td><td>Manchester Piccadilly Depot</td><td>Collected from customer</td> </tr>
 <tr class="even"><td>15-04-2013</td><td>00:00</td><td>WDM Online</td><td></td> </tr>
</tbody>
</table>

<table>
    <thead>
        <tr>
            <th>A</th>
        </tr>
        <tbody>
        <tr>
            <td>B</td>
        </tr>
        </tbody>
    </thead>
  </table>
<doc>

然后调整样式表以考虑对结构的更改,匹配文档元素而不是根节点:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes"/>
        <output>
            <xsl:for-each select="table[@class='sticky-enabled']/tbody/tr">
                <tracking>
                    <date><xsl:value-of select="td[1]" /></date>
                    <time><xsl:value-of select="td[2]" /></time>
                    <event><xsl:value-of select="td[3]" /></event>
                    <extra><xsl:value-of select="td[4]" /></extra>        
                </tracking> 
            </xsl:for-each>
        </output>    
    </xsl:template>
</xsl:stylesheet>