Question

卡在一个兔子洞中，试图解析HTML文件。

基本知识：

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile('myfile.html');
$xp = new DOMXPath($dom);

初始化之后，我的技术一直是使用XPATH查询来获取所需的变量。

我真的没有问题，如果有一个特定的项目或节点，很容易查明和检索。

因此，在我加载的HTML中，它基本上是循环形成的。缩小后看起来像这样：

<div class="intro">
    <div class="desc-wrap">
        Text Text Text
    </div>
    <div class="main-wrap">
        <table class="table-wrap">
            <tbody>
                <tr>
                    <th class="range">Range </th>
                    <th>#1</th>
                    <th>#2</th>
                </tr>
            </tbody>
        </table>
    </div>
</div>
<div class="intro">
    <div class="desc-wrap">
        Text Text Text
    </div>
    <div class="main-wrap">
        <table class="table-wrap">
            <tbody>
                <tr>
                    <th class="range">Range </th>
                    <th>#1</th>
                    <th>#2</th>
                    <th>#3</th>
                    <th>#4</th>
                </tr>
            </tbody>
        </table>
    </div>
</div>

此操作持续100次（表示<div class="intro"> . . . </div>的100个实例

因此，我正在尝试获取desc-wrap的内容（那里没有问题），文本节点以及每个表中有多少<th>个计数。

我查询div可能是一个XPath查询可能比两个查询好。

$intropath = $xp->query("//div[@class='intro']");

圈起来。

$f=1;
foreach ($intropath as $sp) {
echo $f++ . '<br />'; // Makes it way to 100, good.

我遇到的问题/核心问题是试图计算每个表中<th>的数量。

$gettables = $xp->query("//div[contains(@class,'main-wrap')]/table[contains(@class, 'table-wrap')]//th", $sp);
var_dump($getsizes); // public 'length' => int 488
// Okay, so this is getting all the <th> elements in the 
// entire document, not just in the loop. Maybe not what I want.

这是我尝试过的其他方法（我的意思是失败了）

好吧，让我们尝试仅定位第一个表（在[0]之前添加//th），看看是否可以得到一些东西。

$gettables = $xp->query("//div[contains(@class,'main-wrap')]/table[contains(@class, 'table-wrap')][0]//th", $sp);

不。非对象。长度为0。不知道为什么。好吧，让我们开始吧。

也许尝试一下？

//div[contains(@class,'main-wrap')]/table[contains(@class, 'table-wrap')]//th[count(following-sibling::*)]

好的。因此，Length =100。必须得到一个th并进行推断。不是我想要的。

也许只是

//th[count(*)]

不。非对象。

也许是吗？

count(//div[contains(@class,'main-wrap')]/table[contains(@class, 'table-wrap')]//th)

不。更多非对象。

这可能是我尝试过的例子。失败（很好，学习）很有趣，但是我想念的是什么？我的输出...我只想找出每个表中有多少<th>个。

所以，就像：

foreach ($intropath as $sp) {
$xpath = $xp->query("//actual/working/xpath/for/individual/th");
$thcount = count($getsizes->item(0)); // or something?
echo $thcount . '<br>';

在上面的示例中，将输出

3

5

，当然还要继续进行其他98次迭代。

这可能很愚蠢。我一直在引用这个cheatsheet和这个cheatsheet，并且我已经学到了很多有关XPATH功能的知识，但是这个答案在暗示我。此时，我什至不确定foreach ($intropath as $sp) {甚至是否是实现我正在做的事情的正确方法。

有人想把我从这个洞里挖出来，让我继续下一步和/或我的生活吗？

Answer 1

使用迭代的query()调用计数合格节点。

代码：（Demo）

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
foreach ($xp->query("//div[contains(@class,'main-wrap')]/table[contains(@class, 'table-wrap')]//tr") as $node) {
    echo $xp->query("th", $node)->length , "\n";
}

输出：

3
5

Answer 2

首先，查询table：

$intropath = $xp->xpath("//table[contains(@class, 'table-wrap')]");

然后通过另一个XPath查询获得th的计数，每个table的计数和count PHP函数应用于相对于上下文节点的所有th的计数：< / p>

foreach ($intropath as $tab) {
  $count = count($tab->xpath(".//th"));
  echo $count . "<br>";
}

这应该是全部。

PS：
显然，PHP不喜欢XPath count函数，因此我改用了PHP count函数。

仅出于完整性考虑：
如果可以使用XPath-2.0，则以下表达式将更紧凑：

string-join(//table[contains(@class, 'table-wrap')]/count(.//th),'#')

在这里，#是每个table计数之间的分隔符。

使用XPath

2 个答案: