HTML元素的正确X路径?

时间:2017-10-25 19:36:49

标签: php html xpath web-scraping

我需要抓取这个HTML页面......

http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3

enter image description here

....使用PHP和XPath在字符串附近获得值 7 " CODICE GIALLO "

(注意:如果您尝试浏览它,您可以在该页面中看到不同的值...它并不重要......它会改变它的恐怖......)

我使用此PHP代码示例来打印值...

<?php
    ini_set('display_errors', 'On');
    error_reporting(E_ALL);

    $url = 'http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3';

    $xpath_for_parsing = '/html/body/div/div[2]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[2]/table/tbody/tr[4]/td[2]/table/tbody/tr[2]/td[2]/b';

    //#Set CURL parameters: pay attention to the PROXY config !!!!
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_PROXY, '');
    $data = curl_exec($ch);
    curl_close($ch);

    $dom = new DOMDocument();
    @$dom->loadHTML($data);

    $xpath = new DOMXPath($dom);

    $colorWaitingNumber = $xpath->query($xpath_for_parsing);
    $theValue =  'N.D.';
    foreach( $colorWaitingNumber as $node )
    {
      $theValue = $node->nodeValue;
    }

    print $theValue;
?>

通过这种方式,我获得了&#34; N.D。&#34;输出不是&#34; 7 &#34;正如我想的那样。

阅读此Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?我发现该问题与<tbody>标记有关,因此我尝试将其从原始xpath中删除,并尝试使用以下代码:

$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tr[1]/td/table/tr[3]/td[2]/table/tr[4]/td[2]/table/tr[2]/td[2]/b'

但结果仍然是&#34; N.D。&#34;而不是&#34; 7 &#34;。

使用

$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tr[1]/td/table/tr[3]/td[2]/table/tr[4]/td[2]/table'

结果是&#34; Codice GIALLO 7 &#34;

我如何才能获得&#34; 7 &#34;值?

任何建议/示例?

1 个答案:

答案 0 :(得分:1)

这个应该可以解决问题:

//td[.="Codice GIALLO"]/following-sibling::td/b