Question

我需要阅读网络上的大量网页。它是我实际用于获取远程网页的方法。请注意，当前代码是100％正常工作。

    static private GetWebPageResult getWebPage(PageNode pagenode)
{
    String result;
    String inputLine;
    URI url;
    int cicliLettura=0;
    long startTime=0, endTime, openConnTime=0,connTime=0, readTime=0;
    try
    {
        startTime=System.nanoTime();
        result="";
        url=pagenode.getUri();      //fare qualcosa se getURI è null
        if(Core.logGetWebPage())
            openConnTime=System.nanoTime();
        if(url!=null)
        {
            HttpURLConnection yc = (HttpURLConnection) url.toURL().openConnection(); //controllare yc
            if(url.toURL().getProtocol().equalsIgnoreCase("https"))
                yc=(HttpsURLConnection)yc;
            yc.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB;     rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"); 
            yc.connect();           //controllare il risultato di .connect => non c'è! al max lancia IOEXC
            if(checkResponseCode(yc.getResponseCode())==false)
                return new GetWebPageResult(GetWebPageResult.ERR_BAD_RESPONSE_CODE,yc.getResponseCode());
            if(Core.logGetWebPage())
                connTime=System.nanoTime();

            BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));//può lanciare IOEXC
            /*
            while ((inputLine = in.readLine()) != null)
            {
                result=result+inputLine+"\n";
                cicliLettura++;
            }*/
            StringBuffer buffer = new StringBuffer();
            while ((inputLine = in.readLine()) != null)
            {
                buffer.append(inputLine).append('\n');
                cicliLettura++;
            }
            result = buffer.toString();

            if(Core.logGetWebPage())
                readTime=System.nanoTime();
            in.close();
            yc.disconnect();
            if(Core.logGetWebPage())
            {
                endTime=System.nanoTime();
                        //url.toURL() non è null, controllato prima
                System.out.println(/*result+*/"getWebPage eseguito in "+(endTime-startTime)/1000000+" ms. Size: "+result.length()+" Response Code="+yc.getResponseCode()+" Protocollo="+url.toURL().getProtocol()+" openConnTime: "+(openConnTime-startTime)/1000000+" connTime:"+(connTime-openConnTime)/1000000+" readTime:"+(readTime-connTime)/1000000+" cicliLettura="+cicliLettura+" pagina:"+url.toURL());
            }
            return new GetWebPageResult(result);
        }
        else
            return new GetWebPageResult(GetWebPageResult.ERR_NULL_URI,-2);
    }catch(IOException e){
        System.out.println("Eccezione1: "+e.toString());
        e.printStackTrace();  
        return new GetWebPageResult(GetWebPageResult.ERR_HTML_IOEXCEPTION,-2);
    }catch(ClassCastException e){
        System.out.println("Eccezione2: "+e.toString());
        e.printStackTrace(); 
        return new GetWebPageResult(GetWebPageResult.ERR_CLASS_CAST_EXC,-2);
    }catch(Exception e){
        System.out.println("Eccezione ERR_NOT_LISTED_EXC: "+e.toString());
        return new GetWebPageResult(GetWebPageResult.ERR_NOT_LISTED_EXC,-2);
    }
}

鉴于url不为null，请让我们仔细查看代码

HttpURLConnection yc = (HttpURLConnection) url.toURL().openConnection(); //controllare yc
            if(url.toURL().getProtocol().equalsIgnoreCase("https"))
                yc=(HttpsURLConnection)yc;
            yc.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB;     rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"); 
            yc.connect();           //controllare il risultato di .connect => non c'è! al max lancia IOEXC
            if(checkResponseCode(yc.getResponseCode())==false)
                return new GetWebPageResult(GetWebPageResult.ERR_BAD_RESPONSE_CODE,yc.getResponseCode());

.openConnection和.connect方法有什么区别？无论如何，当我们打开连接时，我们开始读取数据

                BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));//può lanciare IOEXC
            StringBuffer buffer = new StringBuffer();
            while ((inputLine = in.readLine()) != null)
            {
                buffer.append(inputLine).append('\n');
                cicliLettura++;
            }
            result = buffer.toString();

好吧，现在我有了一个BufferedReader，我可以从中读取数据。问题是我的带宽通常远远大于远程机器的带宽，因此我希望能够在同一时间从不同的来源读取＃34;。一个很好的方法似乎启动了许多线程，并修改代码的最后部分，如

虽然不是文件结尾，那么是否有完整的行阅读？如果是，请求换行，否则请稍微睡一觉。在这一点上，我继续下一个阅读线程并做同样的事情。这是对的吗？如何实现这个？

Answer 1

这看起来像是一个经典的制作人/消费者场景。您可以通过创建以下类来优化应用程序。如果您还不知道BlockingQueue的概念和生产者 - 消费者问题，我建议您在继续我的答案/设计之前阅读this。

WebPageResult：表示网页内容的Pojo。包含一个StringBuffer，用于保存网页的结果和网页的名称/网址，以标识内容的页面。
ProcessingQueue：具有ArrayBlockingQueue的单例类，用于保存PageNode对象和方法，以便从此队列中添加和轮询PageNode
ResultQueue：具有ArrayBlockingQueue的单例类，用于保存WebPageResult对象和方法，以便从此队列中添加和轮询WebPageResult。
WebPageReader：实现Runnable，在while（true）循环中在其run方法中从ProcessingQueue调用poll方法，并读取从队列中轮询的PageNode的内容。从PageNode读取的内容应该包装到WebPageResult中，并通过调用ResultQueue中的add方法放入ResultQueue。
WebPageProcessor：实现Runnable，在while（true）循环中在其run方法中从ResultQueue调用poll方法，然后使用此内容执行任何操作。

现在你需要做的就是将PageNode对象添加到ProcessingQueue中，启动WebPageReader和WebPageProcessor线程并观察魔术的发生。如果您需要任何澄清，请告诉我。根据您的要求，您可以选择仅启动一个WebPageReader线程和WebPageProcessor线程或多个。设计支持两者。此外，您可以通过抓取Web或为要爬网的页面轮询某种数据库来引入另一个用于将PageNode对象添加到ProcessingQueue的线程。

从不同的远程源读取：如何优化线程？

1 个答案: