使用php抓取图像HTML页面源

时间:2015-02-21 13:58:57

标签: php html web-scraping

我有从HTML网页上抓取图像的功能 这是我要抓的HTML源代码

<div class="single-post-thumb">
        <img width="448" height="298" src="http://www.website.com/wp-content/uploads/2015/02/DSC_2803.jpg" class="attachment-660x330" alt="Description image" title="Description title" />      </div>

这是我的刮刮功能

public function process_individual_links($news_coll)
{       
    echo "Fetching Content - " . $news["news_url"]."". $news["news_images"] . "";   
    $news_coll = array_reverse($news_coll);
    //print_r($news_coll);
    foreach($news_coll as $news)
    {
        $news_url = $news["news_url"];
        $preview = $this->_http->request($news_url);
        $preview = $this->stripNewLine($preview);
    $expr = '#<div class="single-post-thumb"><img .*? src="(.*?)".*?/></div>.*?<div class="entry">(.*?)</div>#';
        preg_match_all($expr, $preview, $matches);
        $count = count($matches[0]) ;
        if($count == 0)
        {
            $expr = '#<div class="entry">(.*?)</div><!-- .entry /-->#';
            $news["news_images"] = str_replace('"', "", $match[1][0]);
            preg_match_all($expr, $preview, $matches);
            $news["news_content"] = $matches[1][0];
        }
        else
        {
            $news["news_images"] = str_replace('"', "", $match[1][0]);
            $news["news_content"] = $matches[2][0];
            echo" $news[news_images] ";
        }
        $imager = str_replace('"', "", $match[1][0]);
        $news["news_content"] = $news["news_content"] . "<p><a href='" . $news_url . "'>Sumber Berita</a></p>".$imager;
        if($this->insertIntoWordpress($news, "TNI") == "-1")                
            echo " ";           
        else                
            echo "Fetching Content - " . $news["news_url"]."". $news["news_images"] . "";
    }
}

我在其他网站上尝试像<img src="">这样的工作,没有src

之前的高度和宽度

我将此表达式称为刮取代码

$expr = '#<div class="single-post-thumb"><img .*? src="(.*?)".*?/></div>.*?<div class="entry">(.*?)</div>#';

0 个答案:

没有答案