PHP SimpleHtmlDom xpath

时间:2013-12-21 07:33:11

标签: php xpath simple-html-dom

我正在尝试获取我正在解析的网页中的节点内容。这是我的代码:

include('simplehtmldom_1_5/simple_html_dom.php');
// get DOM from URL or file
$feedUrl = "http://www.yellowpages.com/md/cpa-tax?menu_search=false&page=1&refinements%5Bfacet_clicked%5D=HeadingText&refinements%5Bheadingtext%5D%5B%5D=Accountants-Certified+Public&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation-Business";
$html = file_get_html($feedUrl);
$xpath = "/html/body/div[5]/div[1]/div[1]/div[1]/div[5]/div[3]/div[1]/div[1]/div[1]/div[1]/a[1]/div[1]/div[1]/div[3]/div[1]/div[2]/h3[1]/div[1]/a[1]";
foreach($html->find($xpath) as $e) 
    echo $e->title . '<br>';

在此示例中,我试图从网页上获取“Tax Experience CPA,Inc”的名称。问题是find($ xpath)返回的数组总是为空。当我打开谷歌浏览器并搜索具有该xpath的节点时,我能够找到我想要的节点。但这不适用于我的代码。我正在使用的路径一定存在问题,但我无法弄清楚它是什么。我搜索过但搜索过但我找不到我做错了什么。 请帮忙。

2 个答案:

答案 0 :(得分:1)

网站上有很多带有id和类的节点,用它们来创建一个更简单的简单xpath表达式来检索你想要的东西!

以下是适合您的工作代码:

// includes Simple HTML DOM Parser
include "simple_html_dom.php";

$feedUrl = "http://www.yellowpages.com/md/cpa-tax?menu_search=false&page=1&refinements%5Bfacet_clicked%5D=HeadingText&refinements%5Bheadingtext%5D%5B%5D=Accountants-Certified+Public&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation-Business";

//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load_file($feedUrl);

// Find all anchors
$anchors = $html->find("//div[@class='srp-business-name']/a");

// Display all titles
foreach($anchors as $a) 
    echo $a->title . '<br>';

<强>输出

Tax Experience CPA Inc
Bernice Hassan CPA Accounting & Tax Services
Begosh Tax Service CPA
At-Home CPA Tax Service
CPA Financial & Tax Service
My Tax CPA
...

Working DEMO

编辑:

这是一个修改过的代码,用于从每个“element / div”中获取标题和电话号码。

请注意find("...", $index)返回$index指定的一个元素(从0开始的第N个元素),如果未设置$index,则返回元素数组...

$feedUrl = "http://www.yellowpages.com/md/cpa-tax?menu_search=false&page=1&refinements%5Bfacet_clicked%5D=HeadingText&refinements%5Bheadingtext%5D%5B%5D=Accountants-Certified+Public&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation-Business";

//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load_file($feedUrl);

// Find all elements
$divs = $html->find('div.business-container-inner');

// loop through all elements and display the useful parts
foreach($divs as $div) {
    $title = $div->find('div.srp-business-name a', 0)->title;

    $phone = $div->find('span.business-phone', 0)->plaintext;

    echo $title ." - ". $phone . "<br>";
}

// Clear DOM object
$html->clear();
unset($html);

Working DEMO

答案 1 :(得分:0)

我想,你应该试试这个。

include('simplehtmldom_1_5/simple_html_dom.php');

// get DOM from URL or file
$feedUrl = "http://www.yellowpages.com/md/cpa-tax?menu_search=false&page=1&refinements%5Bfacet_clicked%5D=HeadingText&refinements%5Bheadingtext%5D%5B%5D=Accountants-Certified+Public&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation-Business";

$html = new simple_html_dom();
$html->load_file($feedUrl);
$xpath = ".srp-business-name a";
foreach($html->find($xpath) as $e) 
    echo $e->title . '<br>';