需要帮助改进此爬虫

时间:2014-11-04 08:38:52

标签: php

我有这个抓取工具,但它只会索引域根目录ex:mydomain.com而不是/somethingelse.php,/ otherpage.html我的意思是网站内部链接......

有没有办法修改此脚本以便能够索引更多页面而不是根目录?

    <?php
require_once('./pathtoconfig');
require_once('./functions.php')
set_time_limit(500);
error_reporting(-1);    
header('Content-Type: text/plain; charset=utf-8;');

$db = @mysqli_connect($conf['host'], $conf['user'], $conf['pass'], $conf['name']);
mysqli_query($db, 'SET NAMES utf8');

if(!$db) {  
    echo "Failed to connect to MySQL: (" . mysqli_connect_errno() . ") " . mysqli_connect_error();
}

//Insert links separated by commas.

$url = array('mydomain1.com', 'mydomain2.com');                         
foreach($url as $k) {       
    $url = parse_url($k);   
    if(!isset($url['path'])) {
        $selectData = "SELECT * FROM web WHERE url = '$k'";
        if(mysqli_fetch_row(mysqli_query($db, $selectData)) == null) {
            $content = getUrl($k);
            preg_match('#<title>(.*)</title>#i', $content, $title);
            preg_match_all('/<img src=.([^"\' ]+)/', $content, $img);
            preg_match('/<head>.+<meta name="description" content=.([^"\']+)/is', $content, $description);
            preg_match('/<head>.+<meta name="author" content=.([^"\']+)/is', $content, $author);
            #preg_match_all('/href=.([^"\' ]+)/i', $content, $anchor);
            preg_match('/<body.*?>(.*?)<\/body>/is', $content, $body);
            if(!empty($title[1]) AND !empty($description[1]) || !empty($body[1])) {
                echo 'Title: '; @print_r($title[1]);
                echo "\n";  
                $body_trim = trim(preg_replace("/&#?[a-z0-9]+;/i",'',(strip_tags(@$body[0])))); $bodyContent = substr(preg_replace('/\s+/', ' ', $body_trim), 0, 255);

                $description_trim = trim(preg_replace("/&#?[a-z0-9]+;/i",'',(strip_tags(@$description[1])))); $descContent = substr(preg_replace('/\s+/', ' ',$description_trim), 0, 255);

                $bodyContent = str_replace('\'', '', $bodyContent);
                $descContent = str_replace('\'', '', $descContent);
                echo 'Description: '; @print_r($descContent);
                echo "\n";
                echo 'Author: '; @print_r($author[1]);
                echo "\n";
                echo 'URL: '; @print_r($k); $date = date("d M Y");
                echo "\n";
                echo "\n---------------------------------------------------------------------------\n";
                $insertData = "INSERT INTO `web` (`url` ,  `title` ,  `description` ,  `body` ,  `author`, `date`) VALUES ('".$k."', '".@$title[1]."', '".@$descContent."', '".@$bodyContent."', '".@$author[1]."', '".$date."')";
                #echo $insertData;
                mysqli_query($db, $insertData);
            }
        }
    }
}
?>

希望你能帮助我,非常感谢,非常感谢。

2 个答案:

答案 0 :(得分:0)

不要使用正则表达式来解析HTML。请改用DomDocument。您可以轻松找到所有链接。这是我用快速谷歌找到的一个功能......你可以看到这是多么简单!

/**
 * @author Jay Gilford
 */

/**
 * get_links()
 * 
 * @param string $url
 * @return array
 */
function get_links($url) {

    // Create a new DOM Document to hold our webpage structure
    $xml = new DOMDocument();

    // Load the url's contents into the DOM (the @ supresses any errors from invalid XML)
    @$xml->loadHTMLFile($url);

    // Empty array to hold all links to return
    $links = array();

    //Loop through each <a> and </a> tag in the dom and add it to the link array
    foreach($xml->getElementsByTagName('a') as $link) {
        $links[] = array('url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
    }

    //Return the links
    return $links;
}

答案 1 :(得分:0)

您的抓取工具不接受包含路径信息的网址,因为您明确检查没有路径:

if(!isset($url['path'])) {

您可以完全删除此测试(以及匹配的结束}),或更改测试以更适合您的需求