Question

我正在研究制作网络爬虫/蜘蛛，但我需要有人指出我正确的方向开始。

基本上，我的蜘蛛会搜索音频文件并将其编入索引。

我只是想知道是否有人对我应该怎么做有任何想法。我听说用PHP完成它会非常慢。我知道vb.net可以派上用场吗？

我正在考虑使用Googles文件类型搜索来获取抓取链接。那可以吗？

Answer 1

以下是有关如何在java中编写Web爬网程序的教程的链接。 http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/我敢肯定，如果你谷歌它，你可以找到其他语言的。

Answer 2

在VB.NET中，您需要先获取HTML，因此请使用WebClient类或HttpWebRequest和HttpWebResponse类。关于如何在互联网上使用这些信息有很多信息。

然后你需要解析HTML。我建议使用正则表达式。

您使用Google进行文件类型搜索的想法非常好。几年前我做了类似的事情来收集PDF以测试SharePoint中的PDF索引，这非常有用。

Answer 3

伪代码应该是：

Method spider(URL startURL){ 
 Collection URLStore; // Can be an arraylist  
    push(startURL,URLStore);// start with a know url
       while URLStore ! Empty do 
         currURL= pop(URLStore); //take an url
         download URL page;
        push (URLx, URLStore); //for all links to URL in the page which are not already followed, then put in the list

要从Java网页中读取一些数据，您可以这样做：

URL myURL = new URL("http://www.w3.org"); 
 BufferedReader in =  new BufferedReader( new InputStreamReader(myURL.openStream())); 
 String inputLine; 
 while ((inputLine = in.readLine()) != null) //you will get all content of the page
 System.out.println(inputLine); //  here you need to extract the hyperlinks
 in.close();

制作网络爬虫/蜘蛛

3 个答案: