Question

大约两天我收到了使用DOM文档而不是正则表达式的建议

我仍然不知道如何正确使用查询

在下面的链接中是“TERRITÓRIOEAMBIENTE”的会话，我想得到下面4行的内容

https://cidades.ibge.gov.br/brasil/sp/sao-paulo/panorama

$html = file_get_contents( 'https://cidades.ibge.gov.br/brasil/sp/sao-paulo/panorama' );    
            $document = new DOMDocument();              
            $document->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
            $domxpath = new DOMXPath($document);
            $paragraphs = $domxpath->query('
                //th[*[
                        contains(text(), "TERRITÓRIO E AMBIENTE")
                      ]
                    ]
                /following-sibling::tr[
                        position() = 12 
                    ]'
            );

我把12 <tr>的数量放了，因为这是源代码中出现的内容，但我不知道我是否正确地执行此查询，这对我来说是出现这些错误

Warning: DOMDocument::loadHTML(): Tag app invalid in Entity, line: 25 
Warning: DOMDocument::loadHTML(): Misplaced DOCTYPE declaration in Entity, line: 25
Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 25

感谢

Answer 1

您的代码中存在多个问题。

您从该网站获取的HTML无效，因此您需要忽略错误（通常不建议这样做，但在这种情况下我认为没问题。）

的

@$document->loadHTML($html);

您正在寻找的文字是小写的（由于其样式，它以大写字母显示），因此您需要将其标准化或将文本放在小写中
你的方法（得到第12个孩子）太脆弱了。我对代码进行了一些检查，但很难让它变得不那么脆弱，但我认为这很接近：

的

//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[1]/td[3]

获取包含文字th的{{1}}元素，然后获取父Território e Ambiente标记，然后转到下一个tr兄弟，最后获得第三个tr {1}}元素（值为）。仍然非常脆弱，但要密切关注网站的变化，它不太可能改变。

所以现在你需要重复那个XPath查询3次，更改第n td个兄弟（添加两个，因为每个中间都有一个空元素）。最终看起来像这样：

tr

第一名：1.521,110平方公里   第二：92.6％
  第三：74,8％
  第四：50,3％

注意使用$document = new DOMDocument(); @$document->loadHTML($html); $domxpath = new DOMXPath($document); $paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[1]/td[3]'); echo "First: ".preg_replace('/\s+/', ' ', $paragraphs[0]->nodeValue); echo "<br>"; $paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[3]/td[3]'); echo "Second: ".preg_replace('/\s+/', ' ', $paragraphs[0]->nodeValue); echo "<br>"; $paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[5]/td[3]'); echo "Third: ".preg_replace('/\s+/', ' ', $paragraphs[0]->nodeValue); echo "<br>"; $paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[7]/td[3]'); echo "Fourth: ".preg_replace('/\s+/', ' ', $paragraphs[0]->nodeValue);来消除丰富的空白。

使用更多的XPath魔法我们可以让它只使用一个查询：

preg_replace()

与其他人一样工作，但不是获得特定的//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[position() mod 2 = 1]/td[3]兄弟元素，而是获取其他所有元素。

tr

$ domxpath-＆gt;查询 - 表格内容

1 个答案: