基本的网页抓取问题:如何使用php创建网站上所有网页的列表?

时间:2009-09-27 05:18:49

标签: php web-crawler

我想创建一个使用php的抓取工具,它会为我提供特定域上所有网页的列表(从主页开始:www.example.com)。

我怎么能在php中做到这一点?

我不知道如何递归查找从特定页面开始并排除外部链接的网站上的所有页面。

2 个答案:

答案 0 :(得分:3)

对于一般方法,请查看这些问题的答案:

在PHP中,您应该能够使用file_get_contents()简单地获取远程URL。您可以使用带有preg_match()的正则表达式执行HTML的天真解析,以查找<a href="">标记并解析其中的URL(有关一些典型方法,请参阅this question)。

提取原始href属性后,您可以使用parse_url()分析其组件并确定其是否是您要提取的网址 - 请记住网址可能与您的网页相关得了。

虽然速度很快,但正则表达式不是解析HTML的最佳方法 - 您也可以尝试DOM classes来解析您获取的HTML,例如:

$dom = new DOMDocument();
$dom->loadHTML($content);

$anchors = $dom->getElementsByTagName('a');

if ( count($anchors->length) > 0 ) {
    foreach ( $anchors as $anchor ) {
        if ( $anchor->hasAttribute('href') ) {
            $url = $anchor->getAttribute('href');

            //now figure out whether to processs this
            //URL and add it to a list of URLs to be fetched
        }
    }
}

最后,不要自己编写,也请参阅此问题以了解您可以使用的其他资源。

答案 1 :(得分:0)

概述

以下是有关爬虫基础知识的一些注意事项。

It is a console app - It doesn't need a rich interface, so I figured a console application would do. The output is done as an html file and the input (what site to view) is done through the app.config. Making a windows app out of this seemed like overkill.
The crawler is designed to only crawl the site it originally targets. It would be easy to change that if you want to crawl more than just a single site, but that is the goal of this little application.
Originally the crawler was just written to find bad links. Just for fun I also had it collect information on page and viewstate sizes. It will also list all non-html files and external urls, just in case you care to see them.
The results are shown in a rather minimalistic html report. This report is automatically opened in Internet Explorer when the crawl is finished.

从Html页面获取文本

构建爬虫的第一个关键部分是外出和从网络(或本地计算机,如果您在本地运行该站点)获取html的机制。和其他很多东西一样,.NET也有用于在框架中构建这个内容的类。

    private static string GetWebText(string url)
    {
        HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
        request.UserAgent = "A .NET Web Crawler";
        WebResponse response = request.GetResponse();
        Stream stream = response.GetResponseStream();
        StreamReader reader = new StreamReader(stream);
        string htmlText = reader.ReadToEnd();
        return htmlText;
    }

HttpWebRequest类可用于从互联网请求任何页面。响应(通过调用GetResponse()检索)保存您想要的数据。获取响应流,将其放入StreamReader,然后阅读文本以获取html。  供参考:http://www.juicer.headrun.com