使用简单的html dom和simpleXML

时间:2015-05-15 13:34:38

标签: php web-scraping simplexml simple-html-dom

我试图从我从xml文件中检索的几个链接中抓取数据。然而,我一直收到一个错误,似乎只出现在一些新闻上。下面你可以看到我得到的输出

http://www.hltv.org/news/14971-rgn-pro-series-groups-drawnRGN Pro Series groups drawn

http://www.hltv.org/news/14969-k1ck-reveal-new-teamk1ck reveal new team

http://www.hltv.org/news/14968-world-championships-captains-unveiled
Fatal error: Call to a member function find() on a non-object in  /app/scrape.php on line 266

这是第266行

$hltv_full_text = $hltv_deep_link->find("//div[@class='rNewsContent']", 0);

完整代码

刮刮功能

function scrape_hltv() {
    $hltv = "http://www.hltv.org/news.rss.php";
    $sxml = simplexml_load_file($hltv);
    global $con;
    foreach($sxml->channel->item as $item)
    {
        $hltv_title = (string)$item->title;
        $hltv_link = (string)$item->link;
        $hltv_date = date('Y-m-d H:i:s', strtotime((string)$item->pubDate));
        echo $hltv_link;

        //if (date('Y-m-d', strtotime((string)$item->pubDate)) ==  date('Y-m-d')){
            if (strpos($hltv_title,'Video:') === false) {
                $hltv_deep_link = file_get_html($hltv_link);
                $hltv_full_text = $hltv_deep_link->find("//div[@class='rNewsContent']", 0);


                echo $hltv_title . '<br><br>';

            }
        //}


    }

}

scrape_hltv();

1 个答案:

答案 0 :(得分:1)

RewriteBase /~asafnevo/api [L] 有几次返回file_get_html()

请在此处查看源代码: http://sourceforge.net/p/simplehtmldom/code/HEAD/tree/trunk/simple_html_dom.php#l79

false

获取链接

http://www.hltv.org/news/14968-world-championships-captains-unveiled

我认为这是因为页面内容大于if (empty($contents) || strlen($contents) > MAX_FILE_SIZE) { return false; } (600 000字节)。页面大小实际上约为3 MB。

如果您还想处理较大的文件,可以尝试修改该函数的版本:

MAX_FILE_SIZE

... define('DEFAULT_TARGET_CHARSET', 'UTF-8'); define('DEFAULT_BR_TEXT', "\r\n"); define('DEFAULT_SPAN_TEXT', " "); function file_get_html_modified($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT) { $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText); $contents = file_get_contents($url, $use_include_path, $context, $offset); if (empty($contents)) { return false; } $dom->load($contents, $lowercase, $stripRN); return $dom; } 已被删除。