PHP DOMDocument从HTML中隔离数据

时间:2012-05-10 15:40:50

标签: php html parsing domdocument

我尝试了几种方法,但没有那么多的succsses所以我的html是这样的:

<td>
  <a href="..?ID=343">
    <img src=".." />
  </a>
</td>
<td>
 <a href="..?id-343">  < - diffirence between two links is that this one has id in lowercase
  Some text..
 </a>
<td>

现在我想获得这个元素和这个内容:             一些文字..

我设法得到这两个信息,但由于某种原因,如果我打印links_array我得到双链接:

  

数组([0] =&gt; http://www.....net/2004/dealer_oglas.asp?id=5895417   [1] =&gt; http://www.....net/2004/dealer_oglas.asp?ID=5895417 [2] =&gt;   http://www.....net/2004/dealer_oglas.asp?id=5883006 [3] =&gt;   http://www.....net/2004/dealer_oglas.asp?ID=5883006 [4]

$ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, "http://www.....net/2004/dealer_Zaloga.asp?dealer=12321");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

    $output = curl_exec($ch);

    $dom = new DOMDocument;
    @$dom->loadHTML($output);




    // Get images
    $images = $dom->getElementsByTagName('img');
    $image_array = array();

    for($i = 0; $i < $images->length; $i++) {
        if($images->item($i)->getAttribute('width') == "80") {
            array_push($image_array, $dom->saveHTML($images->item($i)));
        }
    }

    // Get links
    $links = $dom->getElementsByTagName('a');
    $links_array = array();
    $title_array = array();

   //Here i try to compare the two a that it finds i want to store only the one that does not have img element right after it but for some reason it stores both.

    // All arrays are the same size img, links title
    for($j = 0; $j < $links->length; $j++) {
        if(isset($image_array[$j]) && $dom->saveHTML($links->item($j+1)) != $image_array[$j]) {
            array_push($links_array, 'http://www.....net/2004/' . $links->item($j)->getAttribute('href'));
            array_push($title_array, $links->item($j)->nodeValue);
        }
    }

我尝试比较nodeValue,如果它的“”或“”但没有succsses。感谢您提前获得所有帮助。

1 个答案:

答案 0 :(得分:0)

也许链接实际上在那里两次?

$dom->getElementsByTagName('a')全球搜索。